Fuzzy Matching in Excel Data Using Pandas and Python

Fuzzy Logic for Excel Data - Pandas

Fuzzy logic is a mathematical approach to deal with uncertainty and imprecision in data. In this article, we will explore how to use fuzzy logic to match similar data points between two datasets using pandas in Python.

Introduction to Fuzzy Logic

Fuzzy logic is based on the concept of fuzzy sets, which are sets that contain elements with membership degrees between 0 and 1. This allows for a more nuanced approach to matching similar data points, rather than relying solely on exact matches.

In the context of matching data between two datasets, fuzzy logic can be used to identify similarities between address and name fields in different datasets. For example, if we have a raw dataset containing customer addresses and a mapping dataset containing city and zip code information, we can use fuzzy logic to match similar addresses and cities between the two datasets.

Installing Required Libraries

To implement fuzzy logic in pandas, we will need to install the following libraries:

pandas: A popular Python library for data manipulation and analysis.
numpy: A library for efficient numerical computation in Python.
fuzzywuzzy: A Python library that provides a simple way to perform fuzzy string matching.

We can install these libraries using pip:

pip install pandas numpy fuzzywuzzy

Reading Data from Excel Files

To begin, we need to read our raw dataset and mapping dataset from excel files. We will use the read_excel function from pandas to load the data into dataframes.

import pandas as pd

# Read raw data from excel file
df = pd.read_excel("raw_data.xlsx", index=False)

# Read map data from excel file
mp = pd.read_excel("map_data.xlsx", index=False)

Preprocessing Data

Before we can perform fuzzy matching, we need to preprocess our data by converting the address and name fields into a format that can be compared. We will use the token_sort_ratio function from fuzzywuzzy to calculate the similarity between strings.

from fuzzywuzzy import fuzz

def calculate_similarity(address1, address2):
    return fuzz.token_sort_ratio(address1, address2)

Fuzzy Matching

Now that we have our data preprocessed and ready for comparison, we can perform fuzzy matching between the raw dataset and mapping dataset. We will use the process.extractOne function from fuzzywuzzy to find the most similar match.

def fuzzy_match(address1, address2):
    return process.extractOne(address1, choices=[address2], scorer=fuzz.ratio, score_cutoff=70)

Merging Data

Once we have performed fuzzy matching, we need to merge our raw dataset and mapping dataset into a single dataframe. We will use the merge function from pandas to perform an outer join on the ‘Hospital Name’, ‘City’, and ‘Pincode’ fields.

# Merge raw data and mp data as following 
dfr = mp.merge(df, on=['Hospital Name', 'City', 'Pincode'], how='outer')

Eliminating Duplicate Rows

Since our merge operation may result in duplicate rows, we need to eliminate these duplicates. We will use the groupby function from pandas to group the data by address and then find the row with the highest similarity score.

# Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]

Handling Missing Values

Finally, we need to handle missing values in our merged dataframe. If the mapping dataset does not contain a match for a particular address or name, we can set the corresponding value in the raw dataset to a default value.

# Handle missing values
dfr1['ID'] = dfr1['ID'].fillna('Unknown')

Sample Data

To test our solution, we can use sample data provided by the original poster. The sample data is available online and can be downloaded in excel format.

Code Implementation

Here is a complete code implementation of the solution:

import pandas as pd

# Read raw data from excel file
df = pd.read_excel("raw_data.xlsx", index=False)

# Read map data from excel file
mp = pd.read_excel("map_data.xlsx", index=False)

from fuzzywuzzy import fuzz

def calculate_similarity(address1, address2):
    return fuzz.token_sort_ratio(address1, address2)

def fuzzy_match(address1, address2):
    return process.extractOne(address1, choices=[address2], scorer=fuzz.ratio, score_cutoff=70)

# Merge raw data and mp data as following 
dfr = mp.merge(df, on=['Hospital Name', 'City', 'Pincode'], how='outer')

# Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]

# Handle missing values
dfr1['ID'] = dfr1['ID'].fillna('Unknown')

Conclusion

In this article, we have explored how to use fuzzy logic to match similar data points between two datasets using pandas in Python. We have implemented a solution that merges our raw dataset and mapping dataset into a single dataframe and eliminates duplicate rows based on similarity scores. Finally, we have handled missing values by setting the corresponding value in the merged dataframe to a default value.