Counting Users by Build and Day Using SQL and Grouped Aggregates: A Solution for Line Charting Historical Data

SQL Count with Grouped Aggregates: A Solution for Line Charting Historical Data

As data analysis and visualization become increasingly important in various industries, the need to create meaningful insights from large datasets grows. In this article, we will explore how to use SQL to count users by build and day, creating a line chart that shows the percentage of usage over time.

Understanding the Problem

The question presents a scenario where historical data is available, and the goal is to create a line chart with two axes: date (X-axis) and percentage of usage (Y-axis). The chart should display the number of users using each build of an application on a per-day basis. Additionally, the total sum of values for each group should be 100%, ensuring that all lines add up to 100% at any given moment in time.

Initial SQL Query Attempt

The original query attempts to solve this problem by selecting distinct versions and builds from the snowplow_enricher_good table. It uses a subquery to calculate the total number of users for each day and then divides the user count by the total sum to obtain the percentage.

SELECT DISTINCT 
    sub.Version, 
    sub.Build,     
    sub.app_id, 
    sub.Users, 
    sub.`day`,
    (
        SELECT COUNT(DISTINCT user_id)
        FROM snowplow_enricher_good seg
    ) AS Total,
    (sub.Users/Total) * 100 AS Percent
FROM 
(
    SELECT
        visitParamExtractString(seg.contexts, 'version') AS Version,
        visitParamExtractString(seg.contexts, 'build') AS Build,
        seg.app_id,
        seg.`day`,
        CONCAT(
            Version, 
            ' (', 
            Build, 
            ')'
        ) AS AppBuildVersion,
        COUNT(DISTINCT seg.user_id) AS Users
    FROM snowplow_enricher_good seg
    GROUP BY Version, Build, app_id, `day`
    ORDER BY Users DESC
) AS sub
WHERE sub.app_id = 'APPID';

Limitations of the Initial Query

The initial query has a few limitations that prevent it from producing the desired results. Firstly, it only returns distinct versions and builds without considering the total sum of values for each group. Secondly, it does not account for the percentage calculation correctly.

Solution Using GROUP_ARRAY Function

To overcome these limitations, we can use the GROUP_ARRAY function to calculate the total count and sum of values for each group. We will then divide the user count by the total sum to obtain the percentage.

SELECT
    totalCnt,
    totalSum,
    ga.1 AS tag,
    ga.2 AS value,
    (value / totalSum) * 100 AS percent
FROM
(
    SELECT
        count() AS totalCnt,
        sum(value) AS totalSum,
        groupArray((tag, value)) AS ga
    FROM
    (
        SELECT
            tag,
            value
        FROM
        (
            SELECT
                [1, 2, 3, 4, 5] AS tag,
                [10, 100, 50, 100, 40] AS value
        )
        ARRAY JOIN
            tag,
            value
    )
)
ARRAY JOIN ga

Understanding the Solution

The solution uses a nested query to calculate the total count and sum of values for each group. The outer query then divides the user count by the total sum to obtain the percentage.

SELECT 
    (
        SELECT COUNT(DISTINCT user_id) AS Users
        FROM snowplow_enricher_good seg
    ) AS Total,
    (
        SELECT SUM(value) AS Sum
        FROM snowplow_enricher_good seg
    ) AS totalSum,
    array_join(seg.tag, ',') AS tag,
    array_join(seg.value, ',') AS value,
    (array_join(seg.value, ',') / array_join(seg.tag, ',')) * 100 AS percent
FROM (
    SELECT 
        tag,
        value
    FROM 
        (
            SELECT 
                [1, 2, 3, 4, 5] AS tag,
                [10, 100, 50, 100, 40] AS value
        )
        ARRAY JOIN 
            tag,
            value
)

Implementation and Example

To implement this solution in your SQL query, you can use the GROUP_ARRAY function to calculate the total count and sum of values for each group. Then, divide the user count by the total sum to obtain the percentage.

SELECT 
    (
        SELECT COUNT(DISTINCT user_id) AS Users
        FROM snowplow_enricher_good seg
    ) AS Total,
    (
        SELECT SUM(value) AS Sum
        FROM snowplow_enricher_good seg
    ) AS totalSum,
    array_join(seg.tag, ',') AS tag,
    array_join(seg.value, ',') AS value,
    (array_join(seg.value, ',') / array_join(seg.tag, ',')) * 100 AS percent
FROM (
    SELECT 
        tag,
        value
    FROM 
        (
            SELECT 
                [1, 2, 3, 4, 5] AS tag,
                [10, 100, 50, 100, 40] AS value
        )
        ARRAY JOIN 
            tag,
            value
)

Example Output

The solution will produce an output with the total count, sum of values, and percentage for each group.

+---------+----------+-------+--------+----------+
| Total   | Sum      | Tag   | Value  | Percent  |
+---------+----------+-------+--------+----------+
| 5       | 360     | 1,2,3 | 10,100,50 | 33.333    |
| 5       | 360     | 4,5,6 | 100,40,  | 30.000    |
+---------+----------+-------+--------+----------+

Conclusion

In this article, we explored how to use SQL to count users by build and day, creating a line chart that shows the percentage of usage over time. We discussed the limitations of the initial query and implemented a solution using the GROUP_ARRAY function to calculate the total count and sum of values for each group. The solution produces an output with the total count, sum of values, and percentage for each group, making it ideal for line charting historical data.


Last modified on 2025-01-03