Remove Database Duplicates Using SQL Server Common Table Expressions (CTEs)

Update a database table to remove duplicates with data from another table

Introduction

In this article, we will explore how to update a database table to remove duplicate records based on a combination of columns from another table. We will use SQL Server as an example, but the concepts and syntax can be applied to other relational databases.

The problem statement involves two tables: Table1 and Table2. Table1 has a unique code generated by combining Val1, Val2, and Val3 columns, which is then linked to ItemIds from another table. Meanwhile, Table2 references both the unique codes from Table1 and the corresponding ItemIds. The goal is to remove duplicate records from Table1 while updating the corresponding records in Table2.

Understanding the tables and relationships

Let’s examine the two tables:

Table1

IDItemIdVal1Val2Val3
12aaabbb100
22aaabbb100
32cccddd222
42cccddd222
53ggghhh100

Table2

IDItemIdTable1.Id
10021
10122
10223
10324

The challenge

The initial query attempts to remove duplicates from Table1 by updating the ID column in Table2 based on the maximum ID for each group of ItemId, Val1, Val2, and Val3. However, this approach has limitations when dealing with multiple IDs and cases.

Solution using Common Table Expressions (CTEs)

The provided answer uses a Common Table Expression (CTE) to simplify the query and achieve the desired result in a single UPDATE statement.

CTE for updating Table1

WITH MinIDCTE AS (
    SELECT ID,
           ItemId,
           Val1,
           Val2,
           MIN(ID) OVER (PARTITION BY ItemId, Val1, Val2) AS MinID
    FROM Table1
)
UPDATE t2
SET t2.Table1_Id = m.MinID
FROM Table2 t2
JOIN MinIDCTE m ON t2.Table1_Id = m.ID;

Explanation of the CTE

The CTE MinIDCTE is a temporary result set that contains the minimum ID for each group of ItemId, Val1, and Val2. The OVER clause partitions the data by these columns, and the MIN() function returns the smallest ID for each partition.

In the UPDATE statement, we join Table2 with the CTE on the Table1_Id column, which is updated to reference the minimum ID from the CTE.

Benefits of using a CTE

The CTE approach offers several advantages:

  • Simplifies the query by reducing the number of JOINs and subqueries
  • Reduces the complexity of the UPDATE statement by leveraging the CTE’s pre-computed values
  • Enables efficient updating of multiple records in Table2 based on the corresponding IDs in Table1

Conclusion

In this article, we explored how to update a database table to remove duplicate records while maintaining relationships with another table. We introduced Common Table Expressions (CTEs) as a powerful tool for simplifying complex queries and achieving efficient updates.

By using a CTE, we can simplify the query and achieve the desired result in a single UPDATE statement. This approach is particularly useful when dealing with multiple IDs and cases, making it an effective solution for real-world data cleanup tasks.


Last modified on 2023-08-09