Conditional Groupby on Dates Using Pandas
Introduction
In this article, we will discuss how to perform a conditional groupby operation on dates using pandas. We’ll explore how to filter interactions based on their timestamps relative to customer segmentation data.
Data Preparation
Let’s assume we have two dataframes: customers
and interactions
. The customers
dataframe contains information about customers, including their ID, customer segmentation (e.g., “happy,” “sad”), and timestamp of the survey. The interactions
dataframe contains interactions with customers, such as service visits and phone calls.
import pandas as pd
# Generic example
customers = pd.DataFrame({"customerID":[1,1,1,2,2,3,4,4],"customerSeg":[1,2,2,1,2,3,3,3],"timestamp":['1999-01-01','2000-01-01','2000-06-01','2001-01-01','2003-01-01','1999-01-01','2005-01-01','2008-01-01']})
interactions = pd.DataFrame({"customerID":[1,1,1,1,2,2,2,2,4,4,4],"timestamp":['1999-07-01','1999-11-01','2000-03-01','2001-04-01','2000-12-01','2002-01-01','2004-03-01','2004-05-01','2000-01-01','2004-01-01','2009-01-01'],"service":[1,0,1,0,1,0,1,1,0,1,1],"phonecall":[0,1,1,1,1,1,0,1,1,0,1]})
Converting Timestamps to Datetime
To perform date-based operations, we need to convert the timestamp
columns in both dataframes to datetime objects.
customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])
Creating a Start Date Column
We create a new column, start_date
, which contains the start date (5 years before the timestamp).
customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)
Joining Dataframes on Customer ID
We join the customers
dataframe with the interactions
dataframe on the customerID
.
result = customers.merge(interactions, on='customerID', how='outer')
Filtering Interactions
To filter interactions based on their timestamps relative to customer segmentation data, we use the apply
method.
result['service'] = result.apply(lambda x: x.service if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
result['phonecall'] = result.apply(lambda x: x.phonecall if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
Grouping and Summing
Finally, we group the dataframe by customerID
, timestamp_x
, and customerSeg
and sum up the service
and phonecall
interactions.
result = result.groupby(['customerID', 'timestamp_x', 'customerSeg'])[['service', 'phonecall']].sum()
Result
The resulting dataframe contains the filtered interactions, grouped by customer ID, timestamp, and customer segmentation.
service phonecall
customerID timestamp_x customerSeg
1 1999-01-01 1 0.0 0.0
2000-01-01 2 1.0 1.0
2000-06-01 2 2.0 2.0
2 2001-01-01 1 1.0 1.0
2003-01-01 2 1.0 2.0
3 1999-01-01 3 0.0 0.0
4 2005-01-01 3 1.0 1.0
2008-01-01 3 1.0 0.0
Conclusion
In this article, we demonstrated how to perform a conditional groupby operation on dates using pandas. We covered the following steps:
- Converting timestamps to datetime objects
- Creating a start date column based on customer segmentation data
- Joining dataframes on customer ID
- Filtering interactions based on their timestamps relative to customer segmentation data
- Grouping and summing filtered interactions
By following these steps, you can perform complex date-based operations in pandas.
Last modified on 2024-07-11