Optimizing BigQuery Queries: Extracting Last Amount Value by Stage Using Array Trick

Understanding the Problem and Current Solution

The provided problem involves a SQL query on a BigQuery table to extract specific data based on certain conditions. The goal is to find the last value of the amount in each “island” or stage within a customer’s lifecycle.

Current Attempt and Issues

The original attempt uses several techniques, including:

  • Using ROW_NUMBER() with partitioning by ID and Stage
  • Calculating Start Date using MIN(CreatedDate) OVER (PARTITION BY WindowId, ReverseWindowId)
  • Calculating End Date using NULLIF(MAX(IFNULL(EndDate, '9999-12-31')) OVER(PARTITION BY WindowId, ReverseWindowId), '9999-12-31')
  • Using SELECT DISTINCT instead of GROUP BY

However, these approaches have limitations and do not provide the desired outcome. The issue is to find a more efficient method that accurately extracts the last amount value for each stage.

Proposed Solution

To solve this problem, we will use an array trick in BigQuery. This approach allows us to extract the maximum amount value from each stage without requiring complex calculations or intermediate steps.

Array Trick for Extracting Last Value

The array trick involves creating an array of values and then selecting the last element. In this case, we want to find the last amount value for each stage.

WITH CTE AS (
    SELECT t.ID,
           t.Stage,
           MIN(CreatedDate) OVER (PARTITION BY t.ID ORDER BY CreatedDate) as StartDate,
           MAX(CreatedDate) OVER (PARTITION BY t.ID, t.Stage ORDER BY CreatedDate DESC) as EndDate,
           ARRAY_AGG(t.Amount ORDER BY t.CreatedDate DESC LIMIT 1) as AmountArray
    FROM `BQ_TABLE` t
)
SELECT ID, Stage, StartDate, NULLIF(MAX(End), '9999-12-31') as EndDate, 
       SAFE_ORDINAL(ARRAY_LENGTH(AmountArray) - 1) + 1 AS AmountIndex,
       AmountArray[SAFE_ORDINAL(AmountIndex)]
FROM CTE
GROUP BY ID, Stage, StartDate, AmountIndex
ORDER BY StartDate, AmountIndex;

How the Solution Works

  • We first create a Common Table Expression (CTE) that selects all columns from the table and calculates the StartDate using MIN(CreatedDate) with partitioning by ID.
  • The EndDate is calculated using MAX(CreatedDate) with partitioning by ID and Stage, but we use a trick to avoid null values. We add '9999-12-31' as a default value for the End Date if it’s not present.
  • Next, we create an array of amount values for each stage, ordering them in descending order by CreatedDate using ARRAY_AGG(t.Amount ORDER BY t.CreatedDate DESC LIMIT 1).
  • We then group the results by ID, Stage, StartDate, and AmountIndex (which corresponds to the index of the last element in the amount array). We use the SAFE_ORDINAL function to convert the index to a numeric value.
  • Finally, we select the required columns from the CTE and order the results based on StartDate and AmountIndex.

Conclusion

The proposed solution uses an array trick to efficiently extract the last amount value for each stage in the customer lifecycle. This approach avoids complex calculations and intermediate steps, making it a more efficient and effective method for solving this problem.


Last modified on 2024-05-15