Using regex to Group Similar Expressions in a Dataset Without Prior Knowledge of Those Groups Using R's stringr and qdap Packages

R StringR RegExp Strategy for Grouping Like Expressions Without Prior Knowledge

Introduction

In this article, we will discuss how to group similar expressions in a dataset using the stringr and qdap packages in R. We’ll cover the basics of regular expressions, string manipulation, and data analysis.

The problem at hand is to take a list of 50K+ part numbers with descriptions and determine their corresponding product types based on the description without prior knowledge of the product types. The product description may not follow optimum rules or be perfectly sequential.

We will use the stringr package for string manipulation, including regular expressions (RegExp) and qdap package to find frequencies of different words in the description column.

Step 1: Understanding Regular Expressions

Regular expressions (regex) are a way to describe patterns in strings using special characters. In R’s stringr package, regex can be used for text manipulation.

The pattern we want to match is the product type based on common words that appear together in the description. This may involve matching multiple consecutive sequences of similar words, which requires a specific approach.

One key point here is consecutive word matching, where the goal is not just to find a single sequence but one or more identical words right after each other.

Another challenge is dealing with cases where words are capitalized and lowercased differently (e.g., “Water” vs. “water”), so we need to make sure our regex pattern accounts for case differences.

Step 2: Handling Data Preparation

First, we’ll import the necessary packages (stringr and qdap) and prepare our data. We create a simple dataset with part numbers and descriptions, as shown in the example.

library(tidyverse)
library(stringr)
library(qdap)

df <- tribble(
  ~PartNo, ~Description, ~ProductType,
  "A000443", "Water Bottle", "",
  "A000445", "Contain Water",
  "A000448", "WaterBotHold",
  "HRZ55", "Hershey_Bar",
  "RRB55", "Candy Energy",
  "QMU55", "Bar Protein"
)

Step 3: Finding Frequencies and Creating Regex Pattern

To find the most frequent words in each description, we use the qdap package’s function freq_terms(). This returns a data frame with all unique terms found across the text column, along with their frequency counts.

We’re interested only in the top N (say 2) most frequent words. These will help form our regex pattern to match similar descriptions.

# Find frequencies of different words
freq <- freq_terms(df$Description)

# Select top n words and create regex pattern
word_to_search <- paste0(freq$WORD[1:2], collapse = "|")

Step 4: String Matching with Regex

Using our created word_to_search regex pattern, we apply it to the description column of our data frame (df) after converting all descriptions to lowercase. The purpose is to ensure that identical words across different parts of a text are recognized regardless of capitalization.

# Convert all descriptions to lower case before searching
df$Description <- str_to_lower(df$Description)

# Apply regex pattern search for matching product types
df$ProductType <- str_extract(tolower(df$Description), word_to_search)

Results

Our final step is to view the updated data frame with newly calculated ProductType based on our stringr and qdap strategy.

print(df)
#    PartNo   Description ProductType
# 1 A000443  Water Bottle       water
# 2 A000445 Contain Water       water
# 3 A000448  WaterBotHold       water
# 4   HRZ55   Hershey_Bar         bar
# 5   RRB55  Candy Energy        &lt;NA&gt;    #Didn't match with Water/Bar
# 6   QMU55   Bar Protein         bar

As you can see, the strategy successfully identifies “water” and “bar” as product types based on descriptions that include these words. There is one case where our method didn’t yield a result (RRB55), which could be handled by further tweaking the regex pattern or by implementing additional logic to address such cases.

Conclusion

This guide has walked you through how to group similar expressions in a dataset without prior knowledge of those groups using R’s stringr and qdap packages. By understanding regular expressions, creating appropriate patterns for your data, and leveraging frequency analysis from the qdap package, we can solve real-world problems like this with efficiency.


Last modified on 2023-07-11