R StringR RegExp Strategy for Grouping Like Expressions Without Prior Knowledge
Introduction
In this article, we will discuss how to group similar expressions in a dataset using the stringr and qdap packages in R. We’ll cover the basics of regular expressions, string manipulation, and data analysis.
The problem at hand is to take a list of 50K+ part numbers with descriptions and determine their corresponding product types based on the description without prior knowledge of the product types. The product description may not follow optimum rules or be perfectly sequential.
We will use the stringr
package for string manipulation, including regular expressions (RegExp
) and qdap
package to find frequencies of different words in the description column.
Step 1: Understanding Regular Expressions
Regular expressions (regex) are a way to describe patterns in strings using special characters. In R’s stringr
package, regex can be used for text manipulation.
The pattern we want to match is the product type based on common words that appear together in the description. This may involve matching multiple consecutive sequences of similar words, which requires a specific approach.
One key point here is consecutive word matching, where the goal is not just to find a single sequence but one or more identical words right after each other.
Another challenge is dealing with cases where words are capitalized and lowercased differently (e.g., “Water” vs. “water”), so we need to make sure our regex pattern accounts for case differences.
Step 2: Handling Data Preparation
First, we’ll import the necessary packages (stringr
and qdap
) and prepare our data. We create a simple dataset with part numbers and descriptions, as shown in the example.
library(tidyverse)
library(stringr)
library(qdap)
df <- tribble(
~PartNo, ~Description, ~ProductType,
"A000443", "Water Bottle", "",
"A000445", "Contain Water",
"A000448", "WaterBotHold",
"HRZ55", "Hershey_Bar",
"RRB55", "Candy Energy",
"QMU55", "Bar Protein"
)
Step 3: Finding Frequencies and Creating Regex Pattern
To find the most frequent words in each description, we use the qdap
package’s function freq_terms()
. This returns a data frame with all unique terms found across the text column, along with their frequency counts.
We’re interested only in the top N (say 2) most frequent words. These will help form our regex pattern to match similar descriptions.
# Find frequencies of different words
freq <- freq_terms(df$Description)
# Select top n words and create regex pattern
word_to_search <- paste0(freq$WORD[1:2], collapse = "|")
Step 4: String Matching with Regex
Using our created word_to_search
regex pattern, we apply it to the description column of our data frame (df
) after converting all descriptions to lowercase. The purpose is to ensure that identical words across different parts of a text are recognized regardless of capitalization.
# Convert all descriptions to lower case before searching
df$Description <- str_to_lower(df$Description)
# Apply regex pattern search for matching product types
df$ProductType <- str_extract(tolower(df$Description), word_to_search)
Results
Our final step is to view the updated data frame with newly calculated ProductType
based on our stringr and qdap strategy.
print(df)
# PartNo Description ProductType
# 1 A000443 Water Bottle water
# 2 A000445 Contain Water water
# 3 A000448 WaterBotHold water
# 4 HRZ55 Hershey_Bar bar
# 5 RRB55 Candy Energy <NA> #Didn't match with Water/Bar
# 6 QMU55 Bar Protein bar
As you can see, the strategy successfully identifies “water” and “bar” as product types based on descriptions that include these words. There is one case where our method didn’t yield a result (RRB55), which could be handled by further tweaking the regex pattern or by implementing additional logic to address such cases.
Conclusion
This guide has walked you through how to group similar expressions in a dataset without prior knowledge of those groups using R’s stringr
and qdap
packages. By understanding regular expressions, creating appropriate patterns for your data, and leveraging frequency analysis from the qdap package, we can solve real-world problems like this with efficiency.
Last modified on 2023-07-11