Fuzzy Logic for Excel Data - Pandas
Fuzzy logic is a mathematical approach to deal with uncertainty and imprecision in data. In this article, we will explore how to use fuzzy logic to match similar data points between two datasets using pandas in Python.
Introduction to Fuzzy Logic
Fuzzy logic is based on the concept of fuzzy sets, which are sets that contain elements with membership degrees between 0 and 1. This allows for a more nuanced approach to matching similar data points, rather than relying solely on exact matches.
In the context of matching data between two datasets, fuzzy logic can be used to identify similarities between address and name fields in different datasets. For example, if we have a raw dataset containing customer addresses and a mapping dataset containing city and zip code information, we can use fuzzy logic to match similar addresses and cities between the two datasets.
Installing Required Libraries
To implement fuzzy logic in pandas, we will need to install the following libraries:
pandas
: A popular Python library for data manipulation and analysis.numpy
: A library for efficient numerical computation in Python.fuzzywuzzy
: A Python library that provides a simple way to perform fuzzy string matching.
We can install these libraries using pip:
pip install pandas numpy fuzzywuzzy
Reading Data from Excel Files
To begin, we need to read our raw dataset and mapping dataset from excel files. We will use the read_excel
function from pandas to load the data into dataframes.
import pandas as pd
# Read raw data from excel file
df = pd.read_excel("raw_data.xlsx", index=False)
# Read map data from excel file
mp = pd.read_excel("map_data.xlsx", index=False)
Preprocessing Data
Before we can perform fuzzy matching, we need to preprocess our data by converting the address and name fields into a format that can be compared. We will use the token_sort_ratio
function from fuzzywuzzy to calculate the similarity between strings.
from fuzzywuzzy import fuzz
def calculate_similarity(address1, address2):
return fuzz.token_sort_ratio(address1, address2)
Fuzzy Matching
Now that we have our data preprocessed and ready for comparison, we can perform fuzzy matching between the raw dataset and mapping dataset. We will use the process.extractOne
function from fuzzywuzzy to find the most similar match.
def fuzzy_match(address1, address2):
return process.extractOne(address1, choices=[address2], scorer=fuzz.ratio, score_cutoff=70)
Merging Data
Once we have performed fuzzy matching, we need to merge our raw dataset and mapping dataset into a single dataframe. We will use the merge
function from pandas to perform an outer join on the ‘Hospital Name’, ‘City’, and ‘Pincode’ fields.
# Merge raw data and mp data as following
dfr = mp.merge(df, on=['Hospital Name', 'City', 'Pincode'], how='outer')
Eliminating Duplicate Rows
Since our merge operation may result in duplicate rows, we need to eliminate these duplicates. We will use the groupby
function from pandas to group the data by address and then find the row with the highest similarity score.
# Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]
Handling Missing Values
Finally, we need to handle missing values in our merged dataframe. If the mapping dataset does not contain a match for a particular address or name, we can set the corresponding value in the raw dataset to a default value.
# Handle missing values
dfr1['ID'] = dfr1['ID'].fillna('Unknown')
Sample Data
To test our solution, we can use sample data provided by the original poster. The sample data is available online and can be downloaded in excel format.
Code Implementation
Here is a complete code implementation of the solution:
import pandas as pd
# Read raw data from excel file
df = pd.read_excel("raw_data.xlsx", index=False)
# Read map data from excel file
mp = pd.read_excel("map_data.xlsx", index=False)
from fuzzywuzzy import fuzz
def calculate_similarity(address1, address2):
return fuzz.token_sort_ratio(address1, address2)
def fuzzy_match(address1, address2):
return process.extractOne(address1, choices=[address2], scorer=fuzz.ratio, score_cutoff=70)
# Merge raw data and mp data as following
dfr = mp.merge(df, on=['Hospital Name', 'City', 'Pincode'], how='outer')
# Filter only max ratio rows grouped by Address_x
dfr1 = dfr.iloc[dfr.groupby('Address_x').apply(lambda x: x['SCORE'].idxmax())]
# Handle missing values
dfr1['ID'] = dfr1['ID'].fillna('Unknown')
Conclusion
In this article, we have explored how to use fuzzy logic to match similar data points between two datasets using pandas in Python. We have implemented a solution that merges our raw dataset and mapping dataset into a single dataframe and eliminates duplicate rows based on similarity scores. Finally, we have handled missing values by setting the corresponding value in the merged dataframe to a default value.
Further Reading
For further reading on fuzzy logic and its applications in data analysis, I recommend checking out the following resources:
Last modified on 2025-03-09