Creating Different Dataframe for Conditions on Multiple Columns in Excel using Python

Introduction

In this article, we will explore how to create different dataframes based on conditions applied to multiple columns in a dataset. We’ll use the popular Python library Pandas to achieve this task.

Overview of Pandas

Pandas is a powerful open-source library for data manipulation and analysis in Python. It provides data structures and functions to efficiently handle structured data, including tabular data such as spreadsheets and SQL tables.

The main data structure used in Pandas is the DataFrame, which is similar to an Excel spreadsheet or a table in a relational database. DataFrames are two-dimensional tables of data with rows and columns.

Filtering Data with Pandas

Pandas provides several functions to filter data based on conditions applied to one or more columns. In this article, we’ll focus on filtering data based on multiple conditions applied to different columns.

Data Preparation

To demonstrate the process of creating different dataframes, we’ll use a sample dataset that contains information about invoices with details such as invoice number, main date, reported date, fee, amount, cost, and name.

import pandas as pd

# Create a sample DataFrame
data = {
    'Number': [223311, 111111, 222222],
    'Main Date': ['1/1/2019', '1/2/2019', '1/3/2019'],
    'Reported Date': ['1/1/2019', '1/2/2019', '1/3/2019'],
    'Fee': [100, 100, 100],
    'Amount': [12, 12, 12],
    'Cost': [20, 20, 20],
    'Name': ['John Doe', 'Jane Smith', 'Bob Johnson']
}

df = pd.DataFrame(data)

print(df)

Output:

Number	Main Date	Reported Date	Fee	Amount	Cost	Name
223311	1/1/2019	1/1/2019	100	12	20	John Doe
111111	1/2/2019	1/2/2019	100	12	20	Jane Smith
222222	1/3/2019	1/3/2019	100	12	20	Bob Johnson

Filtering Data based on Multiple Conditions

To create different dataframes, we need to filter the original DataFrame based on multiple conditions applied to different columns. We can use the df[df.apply(lambda row: (row['Main Date'] > row['Reported Date']) & (row['Number'] == 223311), axis=1)] expression to achieve this.

However, a more efficient and Pythonic way is to use boolean indexing.

# Filter data based on multiple conditions
df_223311 = df[df['Number'] == 223311]
df_111111 = df[df['Number'] == 111111]
df_222222 = df[df['Number'] == 222222]

print(df_223311)
print(df_111111)
print(df_222222)

Output:

Number	Main Date	Reported Date	Fee	Amount	Cost	Name
223311	1/1/2019	1/1/2019	100	12	20	John Doe
111111	1/2/2019	1/2/2019	100	12	20	Jane Smith
222222	1/3/2019	1/3/2019	100	12	20	Bob Johnson

Best Practices

To ensure efficient filtering, follow these best practices:

Use boolean indexing instead of the apply method.
Optimize your filter conditions to minimize the number of rows being evaluated.
Avoid using df = df[df['Column'] == value] for large datasets, as this can lead to performance issues.

Conclusion

In this article, we demonstrated how to create different dataframes based on conditions applied to multiple columns in a dataset using Python and Pandas. We discussed best practices for efficient filtering and provided examples to illustrate the process.

By mastering these techniques, you’ll be able to efficiently manipulate and analyze large datasets in Python, making it easier to extract insights and drive business decisions.

Last modified on 2025-03-03