Fuzzy Merge: A Python Approach for Text Similarity Based Data Alignment

Introduction to Fuzzy Merge: A Python Approach for Text Similarity Based Data Alignment

In data analysis and processing, merging dataframes from different sources can be a common requirement. However, when the data contains text-based information that is not strictly numeric or categorical, traditional merge methods may not yield accurate results due to differences in string similarity. This is where fuzzy matching comes into play.

Fuzzy matching is a technique used to find strings that are similar in some way. In this article, we’ll explore how to use Python’s difflib library and pandas for creating a custom class called FuzzyMerge to merge two dataframes based on the text similarity of their columns.

Understanding Fuzzy Matching

Fuzzy matching is based on the idea that strings can be compared using metrics such as Levenshtein distance, Jaro-Winkler distance, or cosine similarity. Among these methods, Jaro-Winkler distance is widely used due to its ability to handle different types of edit operations (insertions, deletions, and substitutions).

The difflib library provides a function called get_close_matches that takes two parameters: the string to be matched against a set of strings, and the cutoff value for the similarity threshold. This function returns a list of matching strings based on their similarity scores.

Setting Up the Environment

To get started with fuzzy merge using Python’s difflib library and pandas, you need to have the following dependencies installed:

Python 3.x
pandas
difflib (part of the Python Standard Library)

You can install pandas and other required libraries using pip:

pip install pandas

Defining the FuzzyMerge Class

The FuzzyMerge class is designed to work like a traditional pandas merge but also merges on approximate matches. Here’s how you can define it:

The FuzzyMerge Class Definition

from dataclasses import dataclass

import pandas as pd

@dataclass()
class FuzzyMerge:
    """
        Works like pandas merge except also merges on approximate matches.
    """
    left: pd.DataFrame
    right: pd.DataFrame
    left_on: str
    right_on: str
    how: str = "inner"
    cutoff: float = 0.3

    def main(self) -> pd.DataFrame:
        temp = self.right.copy()
        temp[self.left_on] = [
            self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
        ]

        return self.left.merge(temp, on=self.left_on, how=self.how)

    def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
        matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)

        return matches[0] if matches else None

Explanation of the FuzzyMerge Class

The FuzzyMerge class has several key attributes:

left: The left dataframe used for merging.
right: The right dataframe used for merging.
left_on: The column name from the left dataframe to merge on.
right_on: The column name from the right dataframe to merge on.
how: The type of merge to perform. Options include "inner", "outer", and others. Defaults to "inner".
cutoff: The similarity threshold for fuzzy matching. A value closer to 1 means a higher cutoff, resulting in fewer matches.

The main method performs the actual merging:

It creates a copy of the right dataframe (temp) to avoid modifying it directly.
For each string in the right_on column of temp, it finds the closest match using the get_closest_match function and assigns this value to the corresponding row in left.
Finally, it merges the left dataframe with temp based on the specified columns.

The get_closest_match Function

The get_closest_match function uses the difflib.get_close_matches method to find the closest match:

def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
    matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)

    return matches[0] if matches else None

This function takes two parameters:

left: The string to find a match for.
right: The set of strings to search through.

It returns the first matching string based on their similarity scores. If no matches are found, it returns None.

Using the FuzzyMerge Class

To use the FuzzyMerge class, you can follow these steps:

Create two dataframes: left_df and right_df.
Define the columns to merge on.
Pass these dataframes, column names, and other parameters to the FuzzyMerge class constructor.
Call the main method to perform the fuzzy merging.

Here’s an example:

# Create sample dataframes
left_df = pd.DataFrame({
    'A': ['apple', 'banana', 'cherry']
})

right_df = pd.DataFrame({
    'B': ['aple', 'banan', 'cheri']
})

# Define the FuzzyMerge class
fuzzy_merge = FuzzyMerge(
    left=left_df,
    right=right_df,
    left_on='A',
    right_on='B'
)

# Perform fuzzy merging
result_df = fuzzy_merge.main()

print(result_df)

Conclusion

Fuzzy merge using Python’s difflib library and pandas provides a powerful approach for aligning data based on text similarity. By understanding how the FuzzyMerge class works, you can adapt it to suit your specific needs and merge data from different sources efficiently.

Code References

Last modified on 2024-03-18