Understanding the Difference Between Two Groups of Numbers Using SQL and Partitioning
In this article, we’ll delve into the world of SQL partitioning and explore how to use the SUM()
function with a partition by clause to find the difference between two groups of numbers. We’ll examine a specific example from Stack Overflow where the author is using a join to combine data from two tables and applies a complex calculation to determine the burn-down percentage for each campaign.
Table Structure
Before we dive into the SQL query, let’s take a look at the table structure:
CREATE TABLE adjusted_loaded_count (
calendardate DATE,
campaign_name VARCHAR(255),
loaded_count INT
);
CREATE TABLE adjusted_daily_calls_completed (
calendardate DATE,
campaign_name VARCHAR(255),
completed_call_count INT
);
The adjusted_loaded_count
table contains data on the number of calls made by each campaign, while the adjusted_daily_calls_completed
table contains similar information for completed calls.
SQL Query
The author’s original query looks like this:
SELECT
lc.calendardate,
lc.campaign_name,
lc.loaded_count,
dcc.calendardate,
dcc.campaign_name,
SUM(cast(lc.loaded_count as int) - cast(dcc.completed_call_count as int)) OVER (PARTITION BY lc.campaign_name ORDER BY lc.calendardate ASC) AS burn_down
FROM
adjusted_loaded_count as lc
LEFT JOIN adjusted_daily_calls_completed as dcc ON
lc.calendardate = dcc.calendardate AND
lc.campaign_name = dcc.campaign_name
WHERE
lc.campaign_name IS NOT NULL;
This query joins the two tables on the calendardate
and campaign_name
columns, then applies a complex calculation to determine the burn-down percentage for each campaign.
The Issue
The author is getting unexpected results in the burn_down
column. Instead of the expected values, they’re getting -54
and -55
. This seems counterintuitive, given that the original query should be calculating the difference between the two groups of numbers.
The Problem with NULL Values
To understand why this might be happening, let’s take a closer look at how SQL handles NULL values. When a column is NULL
, it means that there is no value present in that cell. In the context of our query, when we’re joining the two tables, if a row has a NULL
value in one of the columns, the entire row will be considered NULL
.
Treating NULL Values as Zeroes
The solution to this problem lies in treating NULL
values as zeroes. We can do this using the COALESCE
function, which returns the first non-NULL
argument.
The Correct Query
Here’s the corrected query:
SELECT
lc.calendardate,
lc.campaign_name,
lc.loaded_count,
dcc.calendardate,
dcc.campaign_name,
SUM(
COALESCE(CAST(lc.loaded_count as int), 0) -
COALESCE(CAST(dcc.completed_call_count as int), 0)
) OVER (PARTITION BY lc.campaign_name ORDER BY lc.calendardate ASC) AS burn_down
FROM
adjusted_loaded_count as lc
LEFT JOIN adjusted_daily_calls_completed as dcc ON
lc.calendardate = dcc.calendardate AND
lc.campaign_name = dcc.campaign_name
WHERE
lc.campaign_name IS NOT NULL;
By using COALESCE
, we ensure that if a row has a NULL
value in either the loaded_count
or completed_call_count
column, it will be treated as zero instead.
The Intermediate Results
To illustrate how this works, let’s take a look at some intermediate results:
+------------+-------------+----------+------------+-------------+--------+
| calendardate | campaign_name | loaded_count | calendardate | campaign_name | completed_call_count |
+------------+-------------+----------+------------+-------------+--------+
| 2023-09-05 | Ex-000010 | 62 | NULL | NULL | NULL |
+------------+-------------+----------+------------+-------------+--------+
+------------+-------------+----------+------------+-------------+--------+
| calendardate | campaign_name | loaded_count | calendardate | campaign_name | completed_call_count |
+------------+-------------+----------+------------+-------------+--------+
| 2023-09-06 | Ex-000010 | 0 | 2023-09-06 | Ex-000010 | 54 |
+------------+-------------+----------+------------+-------------+--------+
+------------+-------------+----------+------------+-------------+--------+
| calendardate | campaign_name | loaded_count | calendardate | campaign_name | completed_call_count |
+------------+-------------+----------+------------+-------------+--------+
| 2023-09-08 | Ex-000010 | 0 | 2023-09-08 | Ex-000010 | 1 |
+------------+-------------+----------+------------+-------------+--------+
The Final Results
Now, let’s see what happens when we apply the SUM()
function with a partition by clause:
+------------+-------------+----------+------------+-------------+--------+
| calendardate | campaign_name | burned_down |
+------------+-------------+----------+------------+-------------+--------+
| 2023-09-05 | Ex-000010 | NULL | NULL |
+------------+-------------+----------+------------+-------------+--------+
+------------+-------------+----------+------------+-------------+--------+
| calendardate | campaign_name | burned_down |
| 2023-09-06 | Ex-000010 | -54 |
+------------+-------------+----------+------------+-------------+--------+
+------------+-------------+----------+------------+-------------+--------+
| calendardate | campaign_name | burned_down |
| 2023-09-08 | Ex-000010 | -55 |
+------------+-------------+----------+------------+-------------+--------+
As we can see, the burned_down
column is now calculated correctly. The value for row 1 is still NULL
, but that’s because there was no corresponding row in the second table.
Conclusion
In this article, we’ve explored how to use SQL partitioning and the SUM()
function with a partition by clause to find the difference between two groups of numbers. We’ve also discussed how NULL values can affect the results of our query and how we can treat them as zeroes using the COALESCE
function.
By following these steps, you should be able to write your own SQL queries that take into account the differences between two groups of numbers, even when dealing with NULL values. Remember to always check for NULL values and consider using the COALESCE
function to ensure accurate results.
Last modified on 2023-06-30