Conditional Groupby on Dates Using Pandas

Introduction

In this article, we will discuss how to perform a conditional groupby operation on dates using pandas. We’ll explore how to filter interactions based on their timestamps relative to customer segmentation data.

Data Preparation

Let’s assume we have two dataframes: customers and interactions. The customers dataframe contains information about customers, including their ID, customer segmentation (e.g., “happy,” “sad”), and timestamp of the survey. The interactions dataframe contains interactions with customers, such as service visits and phone calls.

import pandas as pd

# Generic example
customers = pd.DataFrame({"customerID":[1,1,1,2,2,3,4,4],"customerSeg":[1,2,2,1,2,3,3,3],"timestamp":['1999-01-01','2000-01-01','2000-06-01','2001-01-01','2003-01-01','1999-01-01','2005-01-01','2008-01-01']})

interactions = pd.DataFrame({"customerID":[1,1,1,1,2,2,2,2,4,4,4],"timestamp":['1999-07-01','1999-11-01','2000-03-01','2001-04-01','2000-12-01','2002-01-01','2004-03-01','2004-05-01','2000-01-01','2004-01-01','2009-01-01'],"service":[1,0,1,0,1,0,1,1,0,1,1],"phonecall":[0,1,1,1,1,1,0,1,1,0,1]})

Converting Timestamps to Datetime

To perform date-based operations, we need to convert the timestamp columns in both dataframes to datetime objects.

customers['timestamp'] = pd.to_datetime(customers['timestamp'])
interactions['timestamp'] = pd.to_datetime(interactions['timestamp'])

Creating a Start Date Column

We create a new column, start_date, which contains the start date (5 years before the timestamp).

customers['start_date'] = customers['timestamp'] - pd.DateOffset(years=5)

Joining Dataframes on Customer ID

We join the customers dataframe with the interactions dataframe on the customerID.

result = customers.merge(interactions, on='customerID', how='outer')

Filtering Interactions

To filter interactions based on their timestamps relative to customer segmentation data, we use the apply method.

result['service'] = result.apply(lambda x: x.service if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)
result['phonecall'] = result.apply(lambda x: x.phonecall if (x.timestamp_y >= x.start_date) and (x.timestamp_y <= x.timestamp_x) else 0, axis=1)

Grouping and Summing

Finally, we group the dataframe by customerID, timestamp_x, and customerSeg and sum up the service and phonecall interactions.

result = result.groupby(['customerID', 'timestamp_x', 'customerSeg'])[['service', 'phonecall']].sum()

Result

The resulting dataframe contains the filtered interactions, grouped by customer ID, timestamp, and customer segmentation.

                                    service  phonecall
customerID timestamp_x customerSeg                    
1          1999-01-01  1                0.0        0.0
           2000-01-01  2                1.0        1.0
           2000-06-01  2                2.0        2.0
2          2001-01-01  1                1.0        1.0
           2003-01-01  2                1.0        2.0
3          1999-01-01  3                0.0        0.0
4          2005-01-01  3                1.0        1.0
           2008-01-01  3                1.0        0.0

Conclusion

In this article, we demonstrated how to perform a conditional groupby operation on dates using pandas. We covered the following steps:

Converting timestamps to datetime objects
Creating a start date column based on customer segmentation data
Joining dataframes on customer ID
Filtering interactions based on their timestamps relative to customer segmentation data
Grouping and summing filtered interactions

By following these steps, you can perform complex date-based operations in pandas.

Last modified on 2024-07-11