Deduplication of Data in SQL Server (2005+) Using PARTITION BY
The SQL PARTITION BY clause allows you to partition your result set based on pretty much anything you want. Rather than expounding too much, I’m going to dive right into a simple toy example.
Let’s say that you have a table that holds your shopping list:
CREATE TABLE [#DuplicatedData] ( [ID] INT, [MyData] NVARCHAR(255), [Date] DATETIME )
You and your girlfriend both have an application on your phones that lets you enter data into this shopping list. However, you don’t coordinate every single thing you enter, so you’re bound to have some duplication sometimes as each of you realizes that, hey, you’re out of potatoes. So, on January 1, you both enter that you need potatoes, and you further enter that you need celery:
INSERT INTO [#DuplicatedData] VALUES (1, 'Potatoes', '1/1/2010 5:35:04 PM') INSERT INTO [#DuplicatedData] VALUES (2, 'Celery', '1/1/2010 5:36:12 PM') INSERT INTO [#DuplicatedData] VALUES (3, 'Potatoes', '1/1/2010 6:02:38 PM')
A few days later, you’ve used up your potatoes, so you need more:
INSERT INTO [#DuplicatedData] VALUES (4, 'Potatoes', '1/4/2010 1:20:09 PM')
When you went to print out your list on January 2, you didn’t want to see Potatoes twice.. and when you look back on your shopping habits, you also don’t want to see Potatoes twice for January 1. So you decide that, while maybe it’s possible once in a while, it’s probably pretty rate that you will need potatoes more than once in a day, so you’re going to remove any entries that are for the same item on the same day. You’ll only take the latest entry in a day for any particular item. But how?
That’s where the beauty of PARTITION BY comes into play. Let’s take a look at this query:
SELECT ROW_NUMBER() OVER(PARTITION BY [MyData], CONVERT(VARCHAR, [Date], 101) ORDER BY [ID] DESC) AS [RowNum], * FROM [#DuplicatedData]
We are breaking up the data by item and then by date (but not time) and ordering to get the latest entries showing up first, then we are getting a row number to identify the order in each partition of the items. The results from this query look like this:
1 2 Celery 2010-01-01 17:36:12.000 1 3 Potatoes 2010-01-01 18:02:38.000 2 1 Potatoes 2010-01-01 17:35:04.000 1 4 Potatoes 2010-01-04 13:20:09.000
The Potatoes entries on January 1 have been grouped together and numbered 1 and 2, whereas the others are each in their own partitions and are numbered 1. Once we have this, it’s trivial to get the data that we want:
SELECT [MyData], [Date] FROM ( SELECT ROW_NUMBER() OVER(PARTITION BY [MyData], CONVERT(VARCHAR, [Date], 101) ORDER BY [ID] DESC) AS [RowNum], * FROM [#DuplicatedData] ) AS [Sub] WHERE [RowNum] = 1 ORDER BY [Date] ASC
Results:
Celery 2010-01-01 17:36:12.000 Potatoes 2010-01-01 18:02:38.000 Potatoes 2010-01-04 13:20:09.000
Thanks to Sam Grose for teaching me pretty much everything in this post.
Here’s an alternate method using the Group By clause:
SELECT MyData, Max([Date]) Date
FROM [#DuplicatedData]
GROUP BY CONVERT(VARCHAR, [Date], 101), MyData
… which seems like a better choice, at least for your example.
Brian Foote
August 25, 2010 at 1:58 pm
Brian -
True, for the simple example that I gave, that would be better. However, If the fields you are using for deduplication are a subset of the fields in which you are interested, the method presented in this post would be better. Also, if you have any interest in the ID field (such as if you want to mark the duplicate entries as duplicates to ignore), then the method presented would also be more appropriate.
Kevin Akselrod
August 28, 2010 at 2:50 pm