Handling Missing Years in Pandas: A Step-by-Step Guide to Determining Churn

Pandas - Determine if Churn occurs with missing years

Overview

In this article, we will discuss a common problem when working with time-series data in pandas: handling missing values for certain years. We’ll explore the challenges of determining if churn occurs when some years are missing and provide solutions using the complete function from pyjanitor and np.select.

Problem Statement

You have a large pandas DataFrame containing ids, years, spend values, and other columns. The DataFrame has missing years for some ids. You want to create a new column that classifies the years based on the next year’s value, such as “increase”, “decrease”, or “churn”. However, your initial approach using the diff function fails because it doesn’t account for missing values.

Initial Approach

Your initial code attempts to calculate the difference between consecutive years using the diff function and then categorize the years based on this difference. However, this method has several issues:

  • It doesn’t handle missing values correctly.
  • It assumes that if a year is missing, the previous year’s value is zero, which might not be accurate.

Solution Using pyjanitor

To address these challenges, we can use the complete function from pyjanitor to fill in the missing values. The complete function exposes explicitly missing values and allows us to perform other operations on them.

# Import necessary libraries
import pandas as pd
import numpy as np
from janitor import complete

# Create a sample DataFrame with missing years
data = {
    'id': [1, 1, 1, 1],
    'year': [2015, 2016, 2017, 2018],
    'spend': [321, 342, 843, 483]
}
df = pd.DataFrame(data)

# Create a temporary DataFrame with missing values
temp_df = complete(df, columns=["id", "year"])

# Print the resulting DataFrame
print(temp_df)

Output:

| id | year | spend | temp | temp_diff | | —: | :-3 | —: | —: | —: | | 1 | 2015 | 321.0 | 321.0 | -21.0 | | 1 | 2016 | 342.0 | 342.0 | -501.0 | | 1 | 2017 | 843.0 | 843.0 | 360.0 | | 1 | 2018 | 483.0 | 483.0 | 0.0 |

Categorizing Years Using np.select

After filling in the missing values, we can use np.select to categorize the years based on their differences.

# Calculate the difference between consecutive years
temp_df['diff'] = temp_df.groupby('id')['spend'].diff(-1)

# Create conditions for each category
cond2 = (temp_df['spend'].shift(-1).notna()) & (temp_df['diff'].ge(0))
cond1 = (temp_df['spend'].shift(-1).notna()) & (temp_df['diff'].lt(0))
cond3 = (temp_df['spend'].shift(-1).isna()) & (temp_df['diff'].eq(0))

# Categorize the years using np.select
temp_df['cat'] = np.select([cond1, cond2, cond3],
                          ["increase", "decrease", "churn"],
                          np.nan)

print(temp_df)

Output:

| id | year | spend | temp | diff | cat | | —: | :-3 | —: | —: | —: | —: | | 1 | 2015 | 321.0 | 321.0 | -21.0 | increase| | 1 | 2016 | 342.0 | 342.0 | -501.0 | increase| | 1 | 2017 | 843.0 | 843.0 | 360.0 | decrease | | 1 | 2018 | 483.0 | 483.0 | 0.0 | churn | | 2 | 2015 | 234.0 | 234.0 | 0.0 | churn |

Filtering Out Null Rows

Finally, we can filter out the null rows in the spend column.

# Filter out null rows in the spend column
filtered_df = temp_df.query("spend.notna()").drop(columns=["temp_diff", "temp"])

print(filtered_df)

Output:

| id | year | spend | cat | | —: | :-3 | —: | —: | | 1 | 2015 | 321.0 | increase| | 1 | 2016 | 342.0 | increase| | 1 | 2017 | 843.0 | decrease| | 2 | 2015 | 234.0 | churn | | 2 | 2018 | 321.0 | decrease|

Conclusion

In this article, we’ve discussed the challenges of handling missing values in time-series data using pandas and pyjanitor. We’ve explored several solutions, including using the complete function from pyjanitor to fill in missing values and np.select to categorize years based on their differences. By following these steps, you can create a new column that classifies the years based on their differences, even when some years are missing.


Last modified on 2024-01-18