Group by in Pandas to Find Median
Introduction
Pandas is a powerful data analysis library for Python that provides data structures and functions to efficiently handle structured data, including tabular data such as spreadsheets and SQL tables. One of the key features of Pandas is its ability to perform group-by operations, which allow you to perform calculations on subsets of your data.
In this article, we will explore how to use group-by operations in Pandas to find the median of multiple columns in a dataframe. We will also cover how to create a new boolean column to determine if a property is overpriced or not based on its price and other attributes.
Prerequisites
Before you start, make sure you have Python and Pandas installed on your system. You can install Pandas using pip:
pip install pandas numpy
Also, make sure you are familiar with the basics of Python and Pandas data structures.
Sample Data
Let’s create a sample dataframe house
that contains information about properties in different suburbs. The dataframe has columns for suburb, price, rooms, bathrooms, type, car, and age.
import pandas as pd
import numpy as np
# Create the sample dataframe
house = pd.DataFrame({
'suburb': ['BALWYN NORTH', 'ARMADALE', 'ARMADALE', 'PASCOE VALE'],
'price': [1350000.0, 800000.0, 1250000.0, 680000.0],
'rooms': [3, 4, 7, 2],
'bathroom': [1.0, 2.0, 4.0, 1.0],
'type': ['h', 't', 't', 'u'],
'car': ['2.0', '1.0', '4.0', '1.0'],
'age': [59.0, 69.0, 12.0, 14.0]
})
Group by Multiple Columns
One of the key features of Pandas is its ability to perform group-by operations on multiple columns. To find the median of multiple columns, you can use the groupby
method and apply the median
function to each column.
# Group by multiple columns and calculate the median
median_values = house.groupby(['suburb', 'rooms', 'bathroom', 'type', 'car', 'age'])['price'].median()
print(median_values)
This code will group the data by the specified columns and calculate the median of the price
column for each group.
Create a New Boolean Column
To create a new boolean column that determines if a property is overpriced or not, you can use the transform
method to apply the over_price
function to each row in the dataframe.
def over_price(elements):
median = np.median(elements)
return elements > median
house["OverPrice"] = house.groupby(["subrub","rooms","bathroom","type","car","age"])["price"].transform(over_price)
This code defines a function over_price
that takes an array of numbers as input and returns a boolean array indicating whether each number is greater than the median. The groupby
method groups the data by the specified columns, and the transform
method applies the over_price
function to each group.
Explanation
The groupby
method groups the data by one or more columns. In this case, we are grouping by five columns: suburb
, rooms
, bathroom
, type
, and car
. The median
function calculates the median of the price
column for each group.
The transform
method applies a function to each row in the dataframe. In this case, we are applying the over_price
function to each row. The groupby
method groups the data by the specified columns, and the transform
method applies the over_price
function to each group.
Example Use Cases
Here are some example use cases for grouping by multiple columns in Pandas:
- Finding the average price of properties in different suburbs:
average_prices = house.groupby('suburb')['price'].mean()
- Grouping by rooms and calculating the median price of each room type:
median_prices_by_room_type = house.groupby(['rooms', 'type'])['price'].median()
- Finding the maximum age of properties in different suburbs:
max_ages = house.groupby('suburb')['age'].max()
Conclusion
Grouping by multiple columns in Pandas allows you to perform calculations on subsets of your data. By using the groupby
and transform
methods, you can calculate medians and create new boolean columns based on your data.
In this article, we covered how to use group-by operations in Pandas to find the median of multiple columns and create a new boolean column to determine if a property is overpriced or not. We also provided example use cases for grouping by multiple columns in Pandas.
Last modified on 2023-06-18