Understanding Missing Values in Pandas DataFrames: Filling with Conditional Mean

In this article, we’ll explore a common problem in data analysis using Python and the popular Pandas library. We have a DataFrame where some values are missing (NaN), and we want to fill these missing values with the mean of the previous and next value in the same column.

Setting Up the Problem

First, let’s set up our problem by creating a sample DataFrame with missing values:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col':[1,15.6,np.nan, np.nan, 15.8,5, 
                  np.nan, 4,10, np.nan, np.nan,np.nan, 7]})

This DataFrame has a single column ‘col’ with some missing values.

Identifying Consecutive Missing Values

The problem requires us to identify rows where there are two consecutive NaNs in the same column. We can do this by filtering non-missing values, counting the number of consecutive NaNs, and then expanding this mask to include previous and next values for each row with consecutive NaNs.

# Filter non missing values
m = df['col'].notna()

# Count 2 consecutive NaNs
mask = df.groupby(m.cumsum()[~m])['col'].transform('size').eq(2)

print(mask)

Output:

0     False
1      True
2      True
3      True
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
Name: col, dtype: bool

Calculating Conditional Mean

Now that we have identified the rows with consecutive NaNs, we can calculate the conditional mean by taking the average of the previous and next values. To achieve this, we use the bfill and ffill methods to fill the missing values with the last non-missing value (using bfill) and then add these two values together, finally dividing by 2.

# For filtered rows create means
df.loc[mask, 'col'] = df.loc[mask, 'col'].bfill().add(df.loc[mask, 'col'].ffill()).div(2)

print(df)

Output:

Using Mask to Fill All Missing Values

If we want to fill all missing values without restricting them to specific rows, we can simply remove the mask.

# If need means for all missing values remove mask:
df['col'] = df['col'].bfill().add(df['col'].ffill()).div(2)

print(df)

Output:

Conclusion

In this article, we have explored a common problem in data analysis using Python and the Pandas library. We identified rows with consecutive missing values, calculated the conditional mean by taking the average of previous and next values, and demonstrated how to fill all missing values using this approach. By mastering these techniques, you can effectively handle missing values in your data and improve the quality of your insights.