How to Remove Duplicate Records with SQL
- Confirm that you do have duplicate records
- Viewing the duplicates full records
- Removing Duplicates: The Straightforward Way
- Removing Duplicates: The Fancy Way
There are certain situations where duplicate records might get into your database. For example, if you are importing data. I came across this problem working on my project Mana Wisdom, when I was importing orders into my database. I needed to check if any duplicates had been imported by mistake.
Another potential scenario would be an application that has multiple routes to onboarding new users - you can end up with duplicate records in your User table.
In this guide, I walk through how to check a table for duplicates, and how to remove them.
Confirm that you do have duplicate records
Before removing duplicates, you need to check if you have any duplicates in the first place. The columns you want to check for depends on the type of duplication you suspect. The easiest place to start is to check for primary key duplicates.
Check for duplicate primary keys
To do this, query the table and aggregate them by their primary key, in this case
id. Then use the
HAVING clause to only return
ids with a count greater than 1:
SELECT id, COUNT(*) as Count FROM Orders GROUP BY id HAVING COUNT(*) > 1
Here we see that there are a few records with duplicate records, and one (11457) with triplicate records:
Check for duplicate names
If you want to check for duplicate users, you’ll want to check for their name and some other identifier like their address, or in this case email address:
SELECT firstname, lastname, email, COUNT(*) as Count FROM Persons GROUP BY firstname, lastname HAVING COUNT(*) > 1
|Cthulhu||Destroyer of Worldsemail@example.com||3|
Viewing the duplicates full records
Before you delete the duplicate records, you might want to view other columns on the records. There may be information that indicates which is the canonical record. Alternatively, you might want to merge information on two records. Or you may want to find out how the duplicate records were created, to prevent it from happening again.
To do this, use the query above, but now encase it in a common table expression:
;WITH Duplicates AS (SELECT asset_id, COUNT(*) AS count FROM publications GROUP BY asset_id HAVING COUNT(*) > 1)
Then you can join it to the table you are querying, to get all of the columns:
SELECT p.*, d.count FROM Duplicates d INNER JOIN Publications p ON p.asset_id = d.asset_id
Now you can decide which record to keep, and whether to merge any fields:
|Mad Dog||Jones||Mad.Dog.Jones@gmail.com||01/02/2017||Online Subscription||2|
|Mad Dog||Jones||Mad.Dog.Jones@gmail.com||04/02/2018||Company Membership||2|
Removing Duplicates: The Straightforward Way
Ok, so you’ve decided you want to delete the duplicate records.
It’s easier to create a new table with just one row for each distinct value, rather than trying to delete the extra records so that only one valid record remains.
To do this, create a common table expression with a list of distinct records that you want to keep. Then
INNER JOIN the resulting rowset to the table you are working on and save it as a temporary table.
;WITH Duplicates AS (SELECT distinct(asset_id) FROM Orders) SELECT p.* INTO #TempTable FROM Duplicates d INNER JOIN Orders p ON p.asset_id = d.asset_id
You can then
DELETE your table and
INSERT the contents of the temporary table.
DELETE Orders INSERT Orders SELECT * FROM #TempTable DROP TABLE #TempTable
Removing Duplicates: The Fancy Way
There is a faster way to remove duplicates: by using the
ROW_NUMBER function in SQL Server. The
ROW_NUMBER function assigns an ascending row number based on the criteria you specify in the
OVER clause. For example, you can assign a row number that will count the number of name duplications:
SELECT *, ROW_NUMBER() OVER (PARTITION BY firstname, lastname ORDER BY firstname, lastname) AS 'row number' FROM [dbo].[Users]
The row number count is informed by the
ORDER BY statement in the
OVER clause. In this case, it is firstname. So the first James record get a row number of one; the next James get’s a row number of two, and so on:
|James||Goodman||2007-05-08 12:35:29||Goodman@gmail.com||Paper Subscription||1|
|Mad Dog||Jones||2007-05-08 12:35:29||Mad.Dog.Jones@gmail.com||Online Subscription||1|
|Mad Dog||Jones||2019-07-02 19:46:34||Mad.Dog.Jones@gmail.com||Online Subscription||2|
If you’re happy to just delete any record that duplicates the columns you specified, you can encase the query in a common table expression and directly delete records with a row number greater than 1:
WITH CTE AS( SELECT *, ROW_NUMBER() OVER (PARTITION BY firstname, lastname ORDER BY firstname) AS 'row number' FROM [dbo].[Users] ) DELETE FROM CTE WHERE 'row number' > 1