Kevin Akselrod’s Blog

Just another WordPress.com weblog

Deduplication of Data in SQL Server (2005+) Using PARTITION BY

with 2 comments

The SQL PARTITION BY clause allows you to partition your result set based on pretty much anything you want.  Rather than expounding too much, I’m going to dive right into a simple toy example.

Let’s say that you have a table that holds your shopping list:

  CREATE TABLE [#DuplicatedData]
  (
	  [ID]		INT,
	  [MyData]	NVARCHAR(255),
	  [Date]	DATETIME
  )

You and your girlfriend both have an application on your phones that lets you enter data into this shopping list.  However, you don’t coordinate every single thing you enter, so you’re bound to have some duplication sometimes as each of you realizes that, hey, you’re out of potatoes.  So, on January 1, you both enter that you need potatoes, and you further enter that you need celery:

  INSERT INTO [#DuplicatedData] VALUES (1, 'Potatoes', '1/1/2010 5:35:04 PM')
  INSERT INTO [#DuplicatedData] VALUES (2, 'Celery', '1/1/2010 5:36:12 PM')
  INSERT INTO [#DuplicatedData] VALUES (3, 'Potatoes', '1/1/2010 6:02:38 PM')

A few days later, you’ve used up your potatoes, so you need more:

  INSERT INTO [#DuplicatedData] VALUES (4, 'Potatoes', '1/4/2010 1:20:09 PM')

When you went to print out your list on January 2, you didn’t want to see Potatoes twice.. and when you look back on your shopping habits, you also don’t want to see Potatoes twice for January 1.  So you decide that, while maybe it’s possible once in a while, it’s probably pretty rate that you will need potatoes more than once in a day, so you’re going to remove any entries that are for the same item on the same day.  You’ll only take the latest entry in a day for any particular item.  But how?

That’s where the beauty of PARTITION BY comes into play.  Let’s take a look at this query:

  SELECT
	  ROW_NUMBER() OVER(PARTITION BY [MyData], CONVERT(VARCHAR, [Date], 101) ORDER BY [ID] DESC) AS [RowNum],
	  *
  FROM
	  [#DuplicatedData]

We are breaking up the data by item and then by date (but not time) and ordering to get the latest entries showing up first, then we are getting a row number to identify the order in each partition of the items.  The results from this query look like this:

  1	  2	  Celery	2010-01-01 17:36:12.000
  1	  3	  Potatoes	2010-01-01 18:02:38.000
  2	  1	  Potatoes	2010-01-01 17:35:04.000
  1	  4	  Potatoes	2010-01-04 13:20:09.000

The Potatoes entries on January 1 have been grouped together and numbered 1 and 2, whereas the others are each in their own partitions and are numbered 1.  Once we have this, it’s trivial to get the data that we want:

  SELECT [MyData], [Date] FROM
	  (
		  SELECT
			  ROW_NUMBER() OVER(PARTITION BY [MyData], CONVERT(VARCHAR, [Date], 101) ORDER BY [ID] DESC) AS [RowNum],
			  *
		  FROM
			  [#DuplicatedData]
	  ) AS [Sub]
  WHERE
	  [RowNum] = 1
  ORDER BY
	  [Date] ASC

Results:

  Celery	2010-01-01 17:36:12.000
  Potatoes	2010-01-01 18:02:38.000
  Potatoes	2010-01-04 13:20:09.000

Thanks to Sam Grose for teaching me pretty much everything in this post.

Advertisement

Written by Kevin Akselrod

May 19, 2010 at 12:14 am

Posted in SQL

Tagged with , ,

2 Responses

Subscribe to comments with RSS.

  1. Here’s an alternate method using the Group By clause:

    SELECT MyData, Max([Date]) Date
    FROM [#DuplicatedData]
    GROUP BY CONVERT(VARCHAR, [Date], 101), MyData

    … which seems like a better choice, at least for your example.

    Brian Foote

    August 25, 2010 at 1:58 pm

  2. Brian -

    True, for the simple example that I gave, that would be better. However, If the fields you are using for deduplication are a subset of the fields in which you are interested, the method presented in this post would be better. Also, if you have any interest in the ID field (such as if you want to mark the duplicate entries as duplicates to ignore), then the method presented would also be more appropriate.

    Kevin Akselrod

    August 28, 2010 at 2:50 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.