SQL Count with Grouped Aggregates: A Solution for Line Charting Historical Data
As data analysis and visualization become increasingly important in various industries, the need to create meaningful insights from large datasets grows. In this article, we will explore how to use SQL to count users by build and day, creating a line chart that shows the percentage of usage over time.
Understanding the Problem
The question presents a scenario where historical data is available, and the goal is to create a line chart with two axes: date (X-axis) and percentage of usage (Y-axis). The chart should display the number of users using each build of an application on a per-day basis. Additionally, the total sum of values for each group should be 100%, ensuring that all lines add up to 100% at any given moment in time.
Initial SQL Query Attempt
The original query attempts to solve this problem by selecting distinct versions and builds from the snowplow_enricher_good
table. It uses a subquery to calculate the total number of users for each day and then divides the user count by the total sum to obtain the percentage.
SELECT DISTINCT
sub.Version,
sub.Build,
sub.app_id,
sub.Users,
sub.`day`,
(
SELECT COUNT(DISTINCT user_id)
FROM snowplow_enricher_good seg
) AS Total,
(sub.Users/Total) * 100 AS Percent
FROM
(
SELECT
visitParamExtractString(seg.contexts, 'version') AS Version,
visitParamExtractString(seg.contexts, 'build') AS Build,
seg.app_id,
seg.`day`,
CONCAT(
Version,
' (',
Build,
')'
) AS AppBuildVersion,
COUNT(DISTINCT seg.user_id) AS Users
FROM snowplow_enricher_good seg
GROUP BY Version, Build, app_id, `day`
ORDER BY Users DESC
) AS sub
WHERE sub.app_id = 'APPID';
Limitations of the Initial Query
The initial query has a few limitations that prevent it from producing the desired results. Firstly, it only returns distinct versions and builds without considering the total sum of values for each group. Secondly, it does not account for the percentage calculation correctly.
Solution Using GROUP_ARRAY
Function
To overcome these limitations, we can use the GROUP_ARRAY
function to calculate the total count and sum of values for each group. We will then divide the user count by the total sum to obtain the percentage.
SELECT
totalCnt,
totalSum,
ga.1 AS tag,
ga.2 AS value,
(value / totalSum) * 100 AS percent
FROM
(
SELECT
count() AS totalCnt,
sum(value) AS totalSum,
groupArray((tag, value)) AS ga
FROM
(
SELECT
tag,
value
FROM
(
SELECT
[1, 2, 3, 4, 5] AS tag,
[10, 100, 50, 100, 40] AS value
)
ARRAY JOIN
tag,
value
)
)
ARRAY JOIN ga
Understanding the Solution
The solution uses a nested query to calculate the total count and sum of values for each group. The outer query then divides the user count by the total sum to obtain the percentage.
SELECT
(
SELECT COUNT(DISTINCT user_id) AS Users
FROM snowplow_enricher_good seg
) AS Total,
(
SELECT SUM(value) AS Sum
FROM snowplow_enricher_good seg
) AS totalSum,
array_join(seg.tag, ',') AS tag,
array_join(seg.value, ',') AS value,
(array_join(seg.value, ',') / array_join(seg.tag, ',')) * 100 AS percent
FROM (
SELECT
tag,
value
FROM
(
SELECT
[1, 2, 3, 4, 5] AS tag,
[10, 100, 50, 100, 40] AS value
)
ARRAY JOIN
tag,
value
)
Implementation and Example
To implement this solution in your SQL query, you can use the GROUP_ARRAY
function to calculate the total count and sum of values for each group. Then, divide the user count by the total sum to obtain the percentage.
SELECT
(
SELECT COUNT(DISTINCT user_id) AS Users
FROM snowplow_enricher_good seg
) AS Total,
(
SELECT SUM(value) AS Sum
FROM snowplow_enricher_good seg
) AS totalSum,
array_join(seg.tag, ',') AS tag,
array_join(seg.value, ',') AS value,
(array_join(seg.value, ',') / array_join(seg.tag, ',')) * 100 AS percent
FROM (
SELECT
tag,
value
FROM
(
SELECT
[1, 2, 3, 4, 5] AS tag,
[10, 100, 50, 100, 40] AS value
)
ARRAY JOIN
tag,
value
)
Example Output
The solution will produce an output with the total count, sum of values, and percentage for each group.
+---------+----------+-------+--------+----------+
| Total | Sum | Tag | Value | Percent |
+---------+----------+-------+--------+----------+
| 5 | 360 | 1,2,3 | 10,100,50 | 33.333 |
| 5 | 360 | 4,5,6 | 100,40, | 30.000 |
+---------+----------+-------+--------+----------+
Conclusion
In this article, we explored how to use SQL to count users by build and day, creating a line chart that shows the percentage of usage over time. We discussed the limitations of the initial query and implemented a solution using the GROUP_ARRAY
function to calculate the total count and sum of values for each group. The solution produces an output with the total count, sum of values, and percentage for each group, making it ideal for line charting historical data.
Last modified on 2025-01-03