Joining Two Separate Excel-Based DataFrames with Python
Joining two separate Excel-based dataframes that are related by a common column can be achieved using Python and the popular pandas library. In this article, we will explore how to join these dataframes based on a specific condition.
Problem Statement
We have two separate excel files, df1
and df2
, each containing different types of data. The data in both files are related by a common column, namely ceremony_number
. Our goal is to create a new dataframe that combines the data from both files based on this condition.
Step 1: Importing Libraries
To start solving this problem, we need to import the necessary libraries. We will be using pandas for data manipulation and analysis.
import pandas as pd
Step 2: Creating Sample DataFrames
We will create sample dataframes that mimic the structure of df1
and df2
.
# Create df1
data1 = {'birthdate': {0: '9/30/1895', 1: '7/23/1884', 2: '3/29/1889',
3: '4/10/1868', 4: '4/8/1892'},
'ceremony_number': {0: '1st', 1: '1st', 2: '2nd', 3: '3rd', 4: '2nd'}}
df1 = pd.DataFrame(data1)
# Create df2
data2 = {'dates': {1: '1929-05-16', 2: '1930-04-03', 3: '1930-11-05'}}
df2 = pd.DataFrame(data2)
Step 3: Converting Data Types
Before we can perform any operations on these dataframes, we need to convert the data types to match the expected format. We will use pd.to_datetime
to convert the birthdate
column in df1
and the dates
column in df2
.
# Convert data types
df1['birthdate'] = pd.to_datetime(df1['birthdate'], format='%m/%d/%Y')
df2['dates'] = pd.to_datetime(df2['dates'], format='%Y-%m-%d')
Step 4: Extracting Digits from Ceremony Number
To match the ceremony_number
column in df1
with the index
column in df2
, we need to extract the digits from the ceremony_number
column. We will use str.extract
to achieve this.
# Extract digits from ceremony number
num = df1['ceremony_number'].str.extract('^(\d+)', expand=False).astype(int)
Step 5: Mapping Ceremony Number to Date
Now that we have the digits extracted, we can map them to the corresponding dates in df2
. We will use the map
function to achieve this.
# Map ceremony number to date
df1['date_oscar'] = num.map(df2['dates'])
Step 6: Combining Code into a Single Function
Once we have finished all the steps, we can combine the code into a single function that does not take any arguments. This function will perform the necessary operations and return the combined dataframe.
def join_dataframes():
# Create df1
data1 = {'birthdate': {0: '9/30/1895', 1: '7/23/1884', 2: '3/29/1889',
3: '4/10/1868', 4: '4/8/1892'},
'ceremony_number': {0: '1st', 1: '1st', 2: '2nd', 3: '3rd', 4: '2nd'}}
df1 = pd.DataFrame(data1)
# Create df2
data2 = {'dates': {1: '1929-05-16', 2: '1930-04-03', 3: '1930-11-05'}}
df2 = pd.DataFrame(data2)
# Convert data types
df1['birthdate'] = pd.to_datetime(df1['birthdate'], format='%m/%d/%Y')
df2['dates'] = pd.to_datetime(df2['dates'], format='%Y-%m-%d')
# Extract digits from ceremony number
num = df1['ceremony_number'].str.extract('^(\d+)', expand=False).astype(int)
# Map ceremony number to date
df1['date_oscar'] = num.map(df2['dates'])
return df1
# Call the function and print the result
result = join_dataframes()
print(result)
Step 7: Running the Function
Finally, we can run the function using the ipython tool to see the combined dataframe.
>>> result = join_dataframes()
birthdate ceremony_number date_oscar
0 1895-09-30 1st 1929-05-16
1 1884-07-23 1st 1929-05-16
2 1889-03-29 2nd 1930-04-03
3 1868-04-10 3rd 1930-11-05
4 1892-04-08 2nd 1930-04-03
This is the final output of the function, which combines the data from both df1
and df2
based on the common column ceremony_number
.
Last modified on 2023-07-28