Combining Two Datasets and Creating a New Column Based on Specific Conditions in Python

In this article, we will explore how to combine two datasets from different sources (in this case, MySQL DB and Snowflake DB) and create a new column based on specific conditions using Python.

Introduction

We often find ourselves dealing with multiple datasets that need to be merged or combined for analysis, data visualization, or other purposes. In this article, we will focus on combining two datasets from different sources (in this case, MySQL DB and Snowflake DB) and create a new column based on specific conditions using Python.

Background

To understand how to combine two datasets and create a new column based on specific conditions, it’s essential to have some knowledge of Python programming language. Specifically, we will be using the pandas library, which is a powerful data analysis tool in Python.

We’ll also assume that you are familiar with SQL and data manipulation concepts.

Using Merge/Concat Function

The first approach is to use the merge or concat function from pandas to combine the two datasets. However, this method may not be suitable for all scenarios, as it can lead to issues such as data type mismatches or loss of data.

# Assuming that the datasets come as multiline strings:
src_df = """\
emp name
Name1
Name2
Name3
Name4
Name5
Name6
Name7
Name8
Name9
Name10"""

tgt_df = """\
emp name
Name1
NAME2
Name3
Name4
Name5
Name6
Name7

Name9
Name10
Name11"""

To combine these datasets, we can use the concat function:

import pandas as pd

merged_df = pd.concat([src_df_1, tgt_df_1], axis=1)

However, this approach may not be suitable for all scenarios. In the next section, we’ll explore a better approach using the pandas library.

Using Pandas Apply Function

A better approach to combine two datasets and create a new column based on specific conditions is to use the apply function from pandas. This method allows us to define a custom function that takes a row from the dataframe as input and returns a new value for the ‘Record Validation’ column.

def record_validation(row, case_sensitive=True):
    #       can be changed to False 
    global count_matrix_dct
    #           v-- first column of the merged DataFrame
    if row.iloc[0] == '' or pd.isnull(row.iloc[0]): 
        count_matrix_dct['missing data'] += 1
        row.iloc[0] = 'null'
        return 'Missing'
    #             v-- second column of the merged DataFrame
    elif row.iloc[1] == '' or pd.isnull(row.iloc[1]):
        count_matrix_dct['extra'] += 1
        row.iloc[1] = 'null'
        return 'Extra'
    if case_sensitive: 
        # is used to remove leading/trailing spaces  
        ismatch = row.iloc[0].strip() == row.iloc[1].strip()
    else: 
        # assure a not case sensitive  Match                
        ismatch = row.iloc[0].upper().strip() == row.iloc[1].upper().strip()
    if ismatch: 
        count_matrix_dct['match data'] += 1
        return 'Match'
    else:
        count_matrix_dct['mismatch data'] += 1
        return 'Mismatch'
    raise ValueError("None of the conditions gives a return value")

To use this function, we need to apply it to each row of the merged dataframe:

merged_df['Record Validation'] = merged_df.apply(record_validation, axis=1)

This approach allows us to define a custom function that takes into account specific conditions and returns a new value for the ‘Record Validation’ column.

Using Dictionary for Count Matrix

To create a count matrix, we can use a dictionary to store the counts for each category. In this case, we’ll use a dictionary with four categories: ‘match data’, ‘mismatch data’, ‘missing data’, and ’extra’.

count_matrix_dct = {
    'match data'    : 0,
    'mismatch data' : 0,
    'missing data'  : 0,
    'extra'         : 0
}

We can then update this dictionary as we iterate over the merged dataframe:

for index, row in merged_df.iterrows():
    if row['Record Validation'] == 'Match':
        count_matrix_dct['match data'] += 1
    elif row['Record Validation'] == 'Mismatch':
        count_matrix_dct['mismatch data'] += 1
    elif pd.isnull(row['src_emp_name']) or pd.isnull(row['tgt_emp_name']):
        count_matrix_dct['missing data'] += 1
    else:
        count_matrix_dct['extra'] += 1

This approach allows us to create a count matrix that accurately reflects the counts for each category.

Printing Results

Finally, we can print the results using the print_results function:

def print_results():
   print(count_matrix_dct)
   print('====================')
   print("count matrix")
   for k,v in count_matrix_dct.items(): print(f'{k:15} {v:5d}')
   print('====================')
   count_matrix_df = merged_df.groupby(['Record Validation'])['Record Validation'].count()
   print(count_matrix_df)
   print('====================')
   print(merged_df)

This function prints the count matrix, the merged dataframe with the ‘Record Validation’ column, and the original datasets.

Conclusion

In this article, we explored how to combine two datasets from different sources (in this case, MySQL DB and Snowflake DB) and create a new column based on specific conditions using Python. We discussed several approaches, including using the merge or concat function, the apply function, and dictionary-based counting. Each approach has its strengths and weaknesses, and the choice of approach depends on the specific use case and requirements.

By following the steps outlined in this article, you should be able to combine two datasets and create a new column based on specific conditions using Python.

Last modified on 2024-01-09