Understanding Seaborn’s Pair Plot and Its Requirements
Seaborn is a powerful data visualization library built on top of matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. One of its most useful tools for visualizing relationships between variables in a dataset is the pair plot.
A pair plot displays each column of the input dataset as a separate point, with pairs of points representing two columns plotted against each other. This visualization technique can be particularly helpful for exploratory data analysis, allowing users to quickly identify correlations and patterns within their dataset.
However, when working with non-numeric columns in pandas DataFrames, plotting pair plots using seaborn can be problematic. In this article, we’ll delve into the world of numerical and categorical data, exploring how to prepare a DataFrame for a successful pair plot.
Setting Up Our Example
To illustrate our points, let’s set up a simple example. We’ll create a pandas DataFrame with three columns representing Price, Mileage, and Age:
import pandas as pd
# Creating the dataset
dataset = pd.DataFrame({
'Price': [4250, 6500, 26950, 1295, 5999],
'Mileage': [71000, 43100, 10000, 78000, 61600],
'Age': [8, 6, 3, 17, 8]
})
print(dataset)
Output:
Price Mileage Age
0 4250 71000 8
1 6500 43100 6
2 26950 10000 3
3 1295 78000 17
4 5999 61600 8
This DataFrame contains our example data and will serve as the foundation for exploring how to create a pair plot.
Understanding Seaborn’s Pair Plot Requirements
Seaborn’s pair plot function requires that all input DataFrames contain only numeric columns. This is because each point on the plot must have two coordinates (x and y), which are typically represented by the values of two separate columns in the DataFrame.
However, what if we want to include non-numeric columns? Can we still use seaborn’s pair plot?
The Problem with Non-Numeric Columns
Seaborn’s pair plot function will throw an error when attempting to plot a non-numeric column. This is because pandas cannot directly convert these values into numerical coordinates for the x and y axes.
For instance, let’s say we have a new column called ‘City’ containing string values like ‘New York’, ‘Los Angeles’, or ‘Chicago’:
dataset['City'] = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Seattle']
When attempting to create a pair plot with this DataFrame, seaborn will throw an error.
Solution: Coercing Non-Numeric Columns
One way to resolve the issue is by coercing non-numeric columns using pandas’ pd.to_numeric()
function. However, we need to be aware that these values might not convert successfully and may result in NaN (Not a Number) values.
For instance, let’s attempt to coerce our ‘City’ column:
dataset['City'] = pd.to_numeric(dataset['City'], errors='coerce')
Output:
City
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
As we can see, most values in the ‘City’ column were successfully converted to NaN.
Coercing Values in Non-Numeric Columns
When attempting to coerce a non-numeric column with pd.to_numeric()
, you may encounter errors if some of these values are not numeric. For instance:
dataset['Price'] = pd.to_numeric(dataset['Price'], errors='coerce')
Output:
Price Mileage Age City
0 4250 71000 8 NaN
1 6500 43100 6 NaN
2 26950 10000 3 NaN
3 1295 78000 17 NaN
4 5999 61600 8 NaN
As we can see, the ‘Price’ column contains mostly numeric values.
Using Seaborn’s Pair Plot on Coerced Data
Now that we’ve coerced our non-numeric columns to ensure they contain only numeric data, we can attempt to create a pair plot:
import seaborn as sns
import matplotlib.pyplot as plt
sns.pairplot(dataset)
plt.show()
Output:
A 3x3 grid of plots displaying the relationships between each column in our dataset.
This example demonstrates how to successfully use seaborn’s pair plot function on a pandas DataFrame that contains both numeric and non-numeric columns.
Last modified on 2023-06-25