Converting Columns into Rows with Pandas
=====================================================
As data analysts and scientists, we often encounter datasets that have a mix of columnar and row-based structures. In this post, we’ll explore how to convert columns into rows using the popular Python library, Pandas.
Understanding the Problem
The problem at hand is to take a dataset with location information by date, where each date corresponds to a different column header. For example:
location | name | Jan-2010 | Feb-2010 | March-2010 |
---|---|---|---|---|
A | test | 12 | 20 | 30 |
B | foo | 18 | 20 | 25 |
And transform it into a dataset with location, name, and date columns, where each date has its corresponding value:
location | name | Date | Value |
---|---|---|---|
A | test | Jan-2010 | 12 |
A | test | Feb-2010 | 20 |
A | test | March-2010 | 30 |
B | foo | Jan-2010 | 18 |
B | foo | Feb-2010 | 20 |
B | foo | March-2010 | 25 |
Solution Overview
To solve this problem, we’ll use the melt
function from Pandas. This function transforms a dataset with a multi-level index (i.e., multiple columns) into a long format, where each row corresponds to a single observation.
Using melt
The melt
function takes three main arguments:
id_vars
: A list of column names that should be kept as is. These are the variables that will remain unchanged.var_name
: The name of the new column that will contain the original variable names.value_name
: The name of the new column that will contain the original values.
Here’s an example:
import pandas as pd
# Create a sample dataset
data = {
'location': ['A', 'B'],
'name': ['test', 'foo'],
'Jan-2010': [12, 18],
'Feb-2010': [20, 20],
'March-2010': [30, 25]
}
df = pd.DataFrame(data)
# Melt the dataset
df_melted = df.melt(id_vars=['location', 'name'], var_name='Date', value_name='Value')
print(df_melted)
Output:
location | name | Date | Value |
---|---|---|---|
A | test | Jan-2010 | 12 |
A | test | Feb-2010 | 20 |
A | test | March-2010 | 30 |
B | foo | Jan-2010 | 18 |
B | foo | Feb-2010 | 20 |
B | foo | March-2010 | 25 |
Handling Old Versions (< 0.20)
If you’re using an older version of Pandas (< 0.20), you can use the pd.melt
function instead, and then sort the resulting dataset.
Here’s an example:
import pandas as pd
# Create a sample dataset
data = {
'location': ['A', 'B'],
'name': ['test', 'foo'],
'Jan-2010': [12, 18],
'Feb-2010': [20, 20],
'March-2010': [30, 25]
}
df = pd.DataFrame(data)
# Melt the dataset using old version
df_melted_old = pd.melt(df, id_vars=['location', 'name'], var_name='Date', value_name='Value')
print(df_melted_old)
Output:
location | name | Date | Value |
---|---|---|---|
A | test | Jan-2010 | 12 |
A | test | Feb-2010 | 20 |
A | test | March-2010 | 30 |
B | foo | Jan-2010 | 18 |
B | foo | Feb-2010 | 20 |
B | foo | March-2010 | 25 |
Note that pd.DataFrame.sort
has been deprecated in favor of pd.DataFrame.sort_values
.
Final Touches
After using the melt
function, you may want to add a .reset_index(drop=True)
to keep the output clean.
Here’s an example:
import pandas as pd
# Create a sample dataset
data = {
'location': ['A', 'B'],
'name': ['test', 'foo'],
'Jan-2010': [12, 18],
'Feb-2010': [20, 20],
'March-2010': [30, 25]
}
df = pd.DataFrame(data)
# Melt the dataset
df_melted = df.melt(id_vars=['location', 'name'], var_name='Date', value_name='Value')
# Add .reset_index(drop=True) for clean output
df_melted = df_melted.reset_index(drop=True)
print(df_melted)
Output:
location | name | Date | Value |
---|---|---|---|
A | test | Jan-2010 | 12 |
A | test | Feb-2010 | 20 |
A | test | March-2010 | 30 |
B | foo | Jan-2010 | 18 |
B | foo | Feb-2010 | 20 |
B | foo | March-2010 | 25 |
By following these steps, you can easily convert columns into rows using the melt
function in Pandas.
Last modified on 2023-10-15