Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Remove Duplicate Rows in SQL – Keep Multiple

Learn how to remove duplicate rows from an SQL table while keeping multiple entries. Step-by-step SQL queries and best practices.
SQL query removing duplicate rows while keeping multiple occurrences using ROW_NUMBER() function, displayed in a dark-themed SQL editor with a red arrow highlighting key parts of the code. SQL query removing duplicate rows while keeping multiple occurrences using ROW_NUMBER() function, displayed in a dark-themed SQL editor with a red arrow highlighting key parts of the code.
  • ⚠️ Duplicate rows in SQL often occur due to missing unique constraints, incorrect query logic, or data imports.
  • 💡 ROW_NUMBER() with PARTITION BY helps delete excess duplicates while preserving a specified number of occurrences.
  • 🚀 Using DELETE with INNER JOIN effectively removes duplicates without slow subqueries.
  • 🔄 CTEs (Common Table Expressions) provide a structured and efficient way to manage duplicate removal.
  • 📊 Indexing and partitioning significantly improve performance when handling large datasets with duplicates.

Remove Duplicate Rows in SQL – Keep Multiple

Duplicate rows in SQL tables can create inconsistencies, lead to incorrect query results, and cause inefficiencies in database performance. While completely purging duplicates is a standard approach, there are cases where retaining a set number of duplicate entries is necessary. In this comprehensive guide, we’ll explore why duplicates occur, strategies to remove them while retaining multiple instances, and the best SQL techniques for handling duplicates across different database systems, including SQL Server, MySQL, and PostgreSQL.


Why Do Duplicate Rows Occur?

Duplicate rows typically result from various factors, including:

1. Lack of Unique Constraints

  • Tables that are missing primary keys or unique indexes allow identical rows to be inserted multiple times.
  • This is common in databases that have not been designed with strict constraints.

2. Incorrect Query Design

  • Poorly structured SQL queries can unintentionally produce duplicate results, such as improper JOIN conditions or selecting redundant rows.
  • Developers may forget to use DISTINCT when fetching records, leading to unnecessary repetition.

3. Data Import and Synchronization Issues

  • Merging multiple datasets or importing data from external sources can introduce duplicates.
  • Batch insert processes without deduplication mechanisms often lead to duplicate records in tables.

Understanding the root cause of duplicates can help prevent them before they occur. However, when duplicates are already present, structured SQL operations are necessary to clean up the data efficiently.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel


Understanding the Approach to Removing Duplicates While Keeping Some

Duplicate handling strategies in SQL generally fall into two categories:

  1. Removing all duplicates – Ensures every record in a table appears only once.
  2. Keeping a specific number of duplicates – Ensures only a defined number of duplicate rows remain, typically based on business logic (e.g., latest transactions).

To determine duplicates, we must define which columns make a row "identical" and use SQL operations like ROW_NUMBER(), RANK(), and DISTINCT.


Using ROW_NUMBER() to Remove Duplicates (SQL Server, PostgreSQL, MySQL 8+)

The ROW_NUMBER() function is widely used for handling duplicates because it assigns a unique row number to each occurrence within a defined partition.

Example Query (SQL Server, PostgreSQL, MySQL 8+)

WITH CTE AS (
    SELECT 
        id, 
        ROW_NUMBER() OVER (PARTITION BY column_a, column_b ORDER BY id) AS row_num
    FROM your_table
)
DELETE FROM your_table
WHERE id IN (SELECT id FROM CTE WHERE row_num > 2);

How This Works

  • The ROW_NUMBER() function assigns a sequential number to each record within a duplicate group (PARTITION BY column_a, column_b).
  • The ORDER BY id determines which rows receive lower numbers, keeping the first two occurrences.
  • The DELETE statement removes all rows where row_num > 2, keeping only two duplicates per group.

This method is efficient for databases that support ROW_NUMBER(), making it one of the best solutions.


Using Common Table Expressions (CTEs) for Duplicate Removal

CTEs provide an easier way to structure removal logic, making queries more readable.

Example CTE for Keeping N Occurrences

WITH CTE AS (
    SELECT 
        id, 
        ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY created_at DESC) AS row_num
    FROM your_table
)
DELETE FROM your_table 
WHERE id IN (SELECT id FROM CTE WHERE row_num > 3);

This approach ensures that the three most recent occurrences are retained based on the created_at timestamp.


Removing Duplicates With DISTINCT and GROUP BY

For simpler deduplication requirements, DISTINCT and GROUP BY queries can eliminate duplicates, though these methods don't offer fine-grained control.

Using DISTINCT

SELECT DISTINCT column_a, column_b FROM your_table;

This fetches unique combinations of column_a and column_b, but doesn’t delete duplicates unless used within a DELETE or INSERT INTO ... SELECT statement.

Using GROUP BY to Keep Preferred Records

SELECT column_a, column_b, MIN(id) AS id
FROM your_table
GROUP BY column_a, column_b;

This technique keeps only the record with the lowest ID in each duplicate set.


Using DELETE with INNER JOIN for Efficient Duplicate Removal

Joins can enhance performance when removing duplicates compared to nested queries.

DELETE t1 FROM your_table t1
INNER JOIN (
    SELECT id, ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY created_at DESC) AS row_num
    FROM your_table
) t2
ON t1.id = t2.id
WHERE t2.row_num > 2;

This approach prevents performance bottlenecks by leveraging indexing and direct joins.


Keeping Most Recent or Least Recent Records (Using ORDER BY)

For scenarios where the latest or earliest records should be retained, sorting data before deletion is crucial.

WITH CTE AS (
    SELECT 
        *, 
        ROW_NUMBER() OVER (PARTITION BY column_name ORDER BY created_at DESC) AS row_num
    FROM your_table
)
DELETE FROM your_table WHERE row_num > 3;

This ensures that only the three most recent records per duplicate group remain.


Real-World Use Cases for Keeping Multiple Instances

Many industries require retaining limited duplicate records for historical purposes. Examples include:

  • Sales transactions – Keeping the last N transactions per customer for record tracking.
  • System logs – Maintaining only the most recent log entries for debugging purposes.
  • Data warehousing – Retaining only relevant records while clearing outdated ones.

Applying SQL deduplication methods ensures database integrity while preserving necessary historical data.


Performance Considerations When Removing Duplicates

Efficient duplicate removal is crucial when handling large datasets. Consider the following:

  • Index key columns – Ensuring indexed lookups for PARTITION BY and ORDER BY clauses speeds up duplicate searches.
  • Batch deletion – Deleting data in chunks reduces locking and prevents excessive resource consumption.
  • Optimize query execution plans – Check query execution plans using EXPLAIN (MySQL/PostgreSQL) or SET STATISTICS IO ON (SQL Server).

Efficient indexing and query structuring significantly enhance database performance.


Alternative Approaches: Stored Procedures & Triggers

To automate duplication handling, consider:

  • Stored Procedures – Automate scheduled duplicate removal at predefined intervals.
  • Triggers – Prevent duplicate rows from being inserted by enforcing checks at the database level.

For high-transaction systems, proactive constraints provide better long-term efficiency than reactive cleanup queries.


Final Thoughts

Managing duplicate rows in SQL while keeping multiple instances requires a structured approach using ROW_NUMBER(), CTEs, and DELETE strategies. By optimizing performance with indexes and batching, developers can maintain clean, efficient databases. Choosing the right deduplication strategy based on business rules ensures the integrity of data while avoiding unnecessary deletion.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading