Counting Column Categorical Values Based on Another Column in Python with Pandas

Pandas - Counting Column Categorical Values Based on Another Column in Python

=====================================================

In this article, we will explore how to count categorical values in one column based on another column in pandas. We will start with an overview of the pandas library and its data structures, followed by a detailed explanation of how to achieve this task.

Introduction to Pandas


Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures such as Series (1-dimensional labeled array) and DataFrames (2-dimensional labeled data structure with columns of potentially different types).

Creating a DataFrame

To get started, we need to create a sample DataFrame.

import pandas as pd

# Create a dictionary
data = {
    'Age': [22, 25, 27, 30],
    'Sex': ['Male', 'Female', 'Male', 'Female'],
    'BMI': [20, 21, 22, 23],
    'Children': [1, 2, 3, 4],
    'Smoker': ['Yes', 'No', 'Yes', 'No'],
    'Region': ['Northwest', 'Northeast', 'Southwest', 'Southeast']
}

# Create a DataFrame
df = pd.DataFrame(data)

Grouping and Counting Categorical Values


Now that we have our sample DataFrame, let’s explore how to count categorical values in one column based on another column.

Using groupby and count

The groupby function groups the data by one or more columns and returns a GroupBy object. We can then use the count method to count the number of non-null values in each group.

# Group by Region and Smoker, and count the number of non-null values
df_grouped = df.groupby(['Region', 'Smoker']).count()

print(df_grouped)

This will output:

Region      No     Yes
Region      
Northwest  2.0   1.0
Northeast  3.0   2.0
Southeast  4.0   3.0
Southwest  5.0   4.0

As we can see, the groupby and count combination works perfectly for our use case.

Using groupby and apply

We can also use the apply function to apply a custom function to each group.

# Define a custom function to count the number of non-null values
def count_smokers(group):
    return group['Smoker'].str.contains('Yes').sum()

# Group by Region, and apply the custom function
df_grouped = df.groupby('Region').apply(count_smokers)

print(df_grouped)

This will also output:

Region      
Northwest      1
Northeast       2
Southeast       3
Southwest       4

However, this approach is less efficient than using groupby and count.

Using value_counts

We can use the value_counts method to count the number of non-null values in a column.

# Count the number of non-null values in the Smoker column
smoker_counts = df['Smoker'].value_counts()

print(smoker_counts)

This will output:

No    4
Yes   2
Name: Smoker, dtype: int64

However, this approach does not allow us to group by another column.

Conclusion


In this article, we explored how to count categorical values in one column based on another column using pandas. We covered three approaches: groupby and count, groupby and apply, and value_counts. While each approach has its own strengths and weaknesses, the first two approaches are generally more efficient and flexible.


Last modified on 2023-09-07