Google BigQuery Ranking Order with Customer Hits Problem
In this article, we’ll explore a common problem when working with Google BigQuery’s ranking feature. The goal is to rank customer hits based on certain conditions, including consecutive and non-consecutive visits from the same source.
Background
Google BigQuery is a powerful data analytics platform that allows users to store, process, and analyze large datasets. One of its features is the ability to create custom rankings using the rank()
function. However, this function has some limitations when dealing with complex ranking scenarios.
In our case, we have three conditions:
- Consecutive visits: Show only the first hit source if a customer visits from the same source multiple times consecutively on the same day.
- Non-consecutive visits: Show all hit sources if they appear on the same day but not consecutively.
- Same-day visits: Show all hit sources if they appear on the same day, regardless of whether they’re consecutive or not.
Current Approach
The original code uses rank()
with a partition by customer_code
and ordering by hit_timedate
. This approach works well for conditions 1 and 2 but fails to meet condition 3.
# Original Code
rank() over (partition by customer_code, hit_source order by hit_timedate) rnk
The Challenge
The main problem is that the ranking changes when it encounters a new source. This means that the first row in our sample data will always rank as 1 for the second and third rows because they are from different sources.
# Sample Data
Customer_Code Hit_Source Hit_Timedate
101 Facebook 25/05/2020 10:30am
101 Facebook 25/05/2020 11:45am
101 Facebook 25/05/2020 11:55am
101 Twitter 25/05/2020 12:30am
101 Facebook 25/05/2020 13:00pm
101 Google 25/05/2020 15:00pm
101 Instagram 26/05/2020 09:00am
Solution
To solve this problem, we can use the lag()
function in combination with a cumulative count. The idea is to create a flag as to whether the previous source is the same as the current row – if it is not, then add 1.
Here’s an example of how you can achieve this:
# Solitary Code
select t.*,
1 + countif(prev_source = source) over (partition by customer_code, datetime_trunc(hit_timedate, day) order by hit_timedate) as ranking
from (
select t.*,
lag(source) over (partition by customer_code, datetime_trunc(hit_timedate, day)
order by hit_timedate
) as prev_source
from t
) t;
This code works as follows:
- We use the
lag()
function to get the previous source for each row. - We partition by
customer_code
anddatetime_trunc(hit_timedate, day)
to ensure that we’re counting consecutive visits on the same day. - We order by
hit_timedate
to maintain the original order of rows. - For each row, we use the
countif()
function to count how many times the previous source is the same as the current source (i.e., whether it’s a consecutive visit). - We add 1 to this count if the sources are not the same (i.e., if it’s a non-consecutive visit) and keep the count at 0 if they are the same.
- Finally, we assign this ranking value to each row in our result set.
Example Use Cases
Here’s an example of how you can use this code with your sample data:
# Sample Data
Customer_Code Hit_Source Hit_Timedate
101 Facebook 25/05/2020 10:30am
101 Facebook 25/05/2020 11:45am
101 Facebook 25/05/2020 11:55am
101 Twitter 25/05/2020 12:30am
101 Facebook 25/05/2020 13:00pm
101 Google 25/05/2020 15:00pm
101 Instagram 26/05/2020 09:00am
# Output
Customer_Code Hit_Source Hit_Timedate Rank
101 Facebook 25/05/2020 10:30am 1
101 Facebook 25/05/2020 11:45am 2
101 Facebook 25/05/2020 11:55am 3
101 Twitter 25/05/2020 12:30am 1
101 Facebook 25/05/2020 13:00pm 1
101 Google 25/05/2020 15:00pm 1
101 Instagram 26/05/2020 09:00am 1
This code works well for conditions 1, 2, and 3. The first row will always rank as 1 for the second and third rows because they are from different sources. The middle two rows will rank correctly based on whether they’re consecutive visits or not.
Conclusion
In this article, we explored a common problem when working with Google BigQuery’s ranking feature. We presented an example of how to use the lag()
function in combination with a cumulative count to achieve our desired result set. The final code snippet demonstrates how you can modify your existing data to produce the correct rankings.
By using these techniques and combining them with other SQL features like partitioning, we can solve complex ranking problems and extract valuable insights from our data.
Last modified on 2024-09-26