Creating Histograms with Percentage of Type Column

In this article, we will explore how to create histograms where the y-axis represents the percentage of each type in a given bin.

The Problem

A common task when working with data is to visualize the distribution of different types. A histogram can be an effective way to do this. However, sometimes you want to represent not just the count of each type but also its proportion within that bin. For example, if we have a dataset with two types: A and B, and we want to calculate the percentage of A in a given bin.

Given Data

Let’s start with an example dataset:

df = pd.DataFrame([['A', 4], ['B', 12], ['B', 50], ['B', 19], ['A', 39], ['B', 12], ['A', 22], ['B', 33], ['B', 14], ['B', 43], ['A', 50], ['B', 34], ['A', 22],  ['B', 60],
              ['A', 14], ['B', 31], ['B', 40], ['B', 38], ['A', 21], ['B', 41], ['A', 23], ['B', 45], ['B', 25], ['B', 32], ['A', 10], ['B', 31], ['A', 21],  ['B', 51]])
df.columns = ['Type', 'Distance']

The goal is to create a histogram with bins of 10 units, where the y-axis represents the percentage of type A within each bin.

Step 1: Data Preparation

Before we can start creating our histogram, we need to prepare our data. This involves reshaping it so that the type is on the x-axis and the distance is on the y-axis.

# Pivot table to reshape the data
df2 = df.pivot_table(index='Distance', columns='Type', aggfunc='size', fill_value=0)

This step creates a pivot table where each row represents a unique distance, and each column represents a type. The aggfunc parameter is set to 'size', which means that the size of each cell will be the count of the corresponding values.

Step 2: Bin Creation

Next, we need to create bins for our data. We can do this using the cut function in pandas.

# Define bins
bins = range(0, int(df2.index.max())+1, 10)

This step creates a list of bin ranges with increments of 10 units. The maximum value in the dataset is used to determine the upper limit of the last bin.

Step 3: Binning and Grouping

Now that we have our bins, we can group our data by these bins and sum the sizes of each type within them.

# Create a new dataframe with grouped data
df3 = df2.groupby(pd.cut(df2.index, bins=bins)).sum()

This step creates a new dataframe where each row represents a bin, and each column represents a type. The values in these cells are the sum of the sizes of each type within that bin.

Step 4: Calculating Percentages

To calculate the percentages of type A within each bin, we need to divide the size of type A by the total size of all types within that bin and multiply by 100.

# Calculate percentages
df3['A'].div(df3.sum(1)).plot.bar(width=1)

This step calculates the percentage of type A for each bin and plots it as a bar chart. The width parameter is set to 1 unit, which means that each bar will be 1 unit wide.

Example Use Case

Here’s an example use case where we can create a histogram with percentages:

# Create the dataset
df = pd.DataFrame([['A', 4], ['B', 12], ['B', 50], ['B', 19], ['A', 39], ['B', 12], ['A', 22], ['B', 33], ['B', 14], ['B', 43], ['A', 50], ['B', 34], ['A', 22],  ['B', 60],
              ['A', 14], ['B', 31], ['B', 40], ['B', 38], ['A', 21], ['B', 41], ['A', 23], ['B', 45], ['B', 25], ['B', 32], ['A', 10], ['B', 31], ['A', 21],  ['B', 51]])

# Create the pivot table
df2 = df.pivot_table(index='Distance', columns='Type', aggfunc='size', fill_value=0)

# Define bins
bins = range(0, int(df2.index.max())+1, 10)

# Create a new dataframe with grouped data
df3 = df2.groupby(pd.cut(df2.index, bins=bins)).sum()

# Calculate percentages
df3['A'].div(df3.sum(1)).plot.bar(width=1)

This code creates a histogram where the y-axis represents the percentage of type A within each bin. The x-axis represents the distance.

Conclusion

In this article, we explored how to create histograms with percentages. We used pandas to prepare our data and calculate the percentages. The example use case demonstrates how to apply these steps to a real dataset.

Last modified on 2023-07-09