Google BigQuery Ranking Order with Customer Hits Problem

In this article, we’ll explore a common problem when working with Google BigQuery’s ranking feature. The goal is to rank customer hits based on certain conditions, including consecutive and non-consecutive visits from the same source.

Background

Google BigQuery is a powerful data analytics platform that allows users to store, process, and analyze large datasets. One of its features is the ability to create custom rankings using the rank() function. However, this function has some limitations when dealing with complex ranking scenarios.

In our case, we have three conditions:

Consecutive visits: Show only the first hit source if a customer visits from the same source multiple times consecutively on the same day.
Non-consecutive visits: Show all hit sources if they appear on the same day but not consecutively.
Same-day visits: Show all hit sources if they appear on the same day, regardless of whether they’re consecutive or not.

Current Approach

The original code uses rank() with a partition by customer_code and ordering by hit_timedate. This approach works well for conditions 1 and 2 but fails to meet condition 3.

# Original Code
rank() over (partition by customer_code, hit_source order by hit_timedate) rnk

The Challenge

The main problem is that the ranking changes when it encounters a new source. This means that the first row in our sample data will always rank as 1 for the second and third rows because they are from different sources.

# Sample Data
Customer_Code       Hit_Source               Hit_Timedate
     101             Facebook             25/05/2020 10:30am
     101             Facebook             25/05/2020 11:45am
     101             Facebook             25/05/2020 11:55am
     101             Twitter              25/05/2020 12:30am
     101             Facebook             25/05/2020 13:00pm 
     101             Google               25/05/2020 15:00pm
     101             Instagram            26/05/2020 09:00am

Solution

To solve this problem, we can use the lag() function in combination with a cumulative count. The idea is to create a flag as to whether the previous source is the same as the current row – if it is not, then add 1.

Here’s an example of how you can achieve this:

# Solitary Code
select t.*,
       1 + countif(prev_source = source) over (partition by customer_code, datetime_trunc(hit_timedate, day) order by hit_timedate) as ranking
from (
    select t.*,
           lag(source) over (partition by customer_code, datetime_trunc(hit_timedate, day)
                             order by hit_timedate
                            ) as prev_source
    from t
) t;

This code works as follows:

We use the lag() function to get the previous source for each row.
We partition by customer_code and datetime_trunc(hit_timedate, day) to ensure that we’re counting consecutive visits on the same day.
We order by hit_timedate to maintain the original order of rows.
For each row, we use the countif() function to count how many times the previous source is the same as the current source (i.e., whether it’s a consecutive visit).
We add 1 to this count if the sources are not the same (i.e., if it’s a non-consecutive visit) and keep the count at 0 if they are the same.
Finally, we assign this ranking value to each row in our result set.

Example Use Cases

Here’s an example of how you can use this code with your sample data:

# Sample Data
Customer_Code       Hit_Source               Hit_Timedate
     101             Facebook             25/05/2020 10:30am
     101             Facebook             25/05/2020 11:45am
     101             Facebook             25/05/2020 11:55am
     101             Twitter              25/05/2020 12:30am
     101             Facebook             25/05/2020 13:00pm 
     101             Google               25/05/2020 15:00pm
     101             Instagram            26/05/2020 09:00am

# Output
Customer_Code       Hit_Source               Hit_Timedate        Rank
     101             Facebook             25/05/2020 10:30am       1
     101             Facebook             25/05/2020 11:45am       2
     101             Facebook             25/05/2020 11:55am       3
     101             Twitter              25/05/2020 12:30am       1
     101             Facebook             25/05/2020 13:00pm       1
     101             Google               25/05/2020 15:00pm       1
     101             Instagram            26/05/2020 09:00am       1

This code works well for conditions 1, 2, and 3. The first row will always rank as 1 for the second and third rows because they are from different sources. The middle two rows will rank correctly based on whether they’re consecutive visits or not.

Conclusion

In this article, we explored a common problem when working with Google BigQuery’s ranking feature. We presented an example of how to use the lag() function in combination with a cumulative count to achieve our desired result set. The final code snippet demonstrates how you can modify your existing data to produce the correct rankings.

By using these techniques and combining them with other SQL features like partitioning, we can solve complex ranking problems and extract valuable insights from our data.

Last modified on 2024-09-26