SELECT DISTINCT to Return at Most One Row

Introduction

The problem statement is as follows:

Given two tables, Regions and Customers, with the following structure:

+----+-------+
| id | name  |
+----+-------+
| 1  | EU    |
| 2  | US    |
| 3  | SEA   |
+----+-------+

+----+-------+--------+
| id | name  | region |
+----+-------+--------+
| 1  | peter | 1      |
| 2  | henry | 1      |
| 3  | john  | 2      |
+----+-------+--------+

We want to write a query that takes two customer IDs, senderCustomerId and receiverCustomerId, as input and returns the region ID of both customers if they are in the same region. The query should return at most one row.

The solution involves using Common Table Expressions (CTEs) and windowing functions to achieve this.

Why SQL Doesn’t Have a Single Row Aggregation Function

SQL does not have a built-in “single row” aggregation function, unlike some other programming languages like MATLAB or Python. However, we can use the MIN function with a CASE WHEN COUNT() expression in a CTE or derived table as an equivalent operation.

Windowing Functions and GROUP BY

Unfortunately, windowing functions do not work in GROUP BY queries, despite being similar in purpose. This is due to the ISO SQL committee’s design decisions.

However, we can still use windowing functions with other aggregation functions, like MIN or MAX, in a SELECT statement without grouping by any columns.

Solving the Problem

To solve this problem, we need to query the customer table for both the sender and receiver IDs and verify that both their region ID is identical. We can use a CTE to first count the number of customers with each ID and then check if there are two regions in common between the two sets.

Here’s an example query that accomplishes this:

WITH q AS (
    SELECT
        COUNT(*) AS CountCustomers,
        COUNT(DISTINCT region) AS CountDistinctRegions,
--      MIN(region) AS MinRegion
        FIRST_VALUE(region) OVER (ORDER BY region) AS SingleRegion
    FROM Customers c
    WHERE c.CustomerId = $senderCustomerId OR c.CustomerId = $receiverCustomerId
)
SELECT
    CASE WHEN q.CountCustomers = 2 AND q.CountDistinctRegions = 2 THEN 'OK' ELSE 'BAD' END AS "Status",
    CASE WHEN q.CountDistinctRegions = 2 THEN q.SingleRegion ELSE NULL END AS SingleRegion
FROM q

This query uses a CTE to first count the number of customers with each ID and then check if there are two regions in common between the two sets. If both conditions are true, it returns OK, otherwise it returns BAD. The SingleRegion column is only returned when there are two distinct regions.

Explanation

Let’s break down the query step by step:

We create a CTE named q that counts the number of customers with each ID and checks if there are any duplicate regions.
In the CTE, we use COUNT(*) to count the total number of rows for each customer ID.
We use COUNT(DISTINCT region) to check if there are any duplicate regions between the two sets of customers.
We use FIRST_VALUE(region) OVER (ORDER BY region) to get the first occurrence of each distinct region, which is equivalent to getting a single row with all unique values.
In the outer query, we use CASE statements to check if there are exactly two regions in common between the two sets of customers. If so, it returns OK, otherwise it returns BAD.
We also use another CASE statement to get the single region value, which is only returned when there are exactly two distinct regions.

Conclusion

The query uses a combination of CTEs and windowing functions to solve the problem efficiently and effectively. By using these techniques, we can achieve our goal of returning at most one row with the region ID of both customers if they are in the same region.

Last modified on 2024-08-08