Introduction
In this article, we will explore how to count the frequency of two-word combinations in all rows of a column using Python and its popular libraries. The problem is related to text processing, specifically bigram tokenization, which involves splitting sentences into pairs of consecutive words.
We’ll walk through a step-by-step approach, starting from preparing our data, cleaning it up, and then counting the frequency of two-word combinations.
Preparing the Data
To start with this task, you need a pandas DataFrame containing your text data. Here’s how we can prepare the data.
data = pd.DataFrame({
'Sentence': [
'beautiful day suffered through',
'beautiful day suffered through',
'beautiful day suffered through',
'cannot hold back tears',
'cannot hold back tears',
'cannot hold back tears',
'ash back tears beautiful day',
'ash back tears beautiful day',
'ash back tears beautiful day',
'ash back tears beautiful day'
],
'words': [
'beautiful day',
'day suffered',
'suffered through',
'cannot hold',
'hold back',
'back tears',
'ash back',
'back tears',
'tears beautiful',
'beautiful day'
]
})
Cleaning the Data
Before we can start counting, we need to clean up our data by removing quotes and whitespaces at the beginning and end of both Sentence
and words
.
data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()
Setting Data Types
We also need to ensure that the Sentence
and words
columns are of type string.
data = data.astype({"Sentence": str, "words": str})
Counting Two-Word Combinations
To count the frequency of two-word combinations in each sentence on the same row, we’ll create a function called words_in_sent
.
def words_in_sent(row):
return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)
Grouping and Summing
Finally, we group by the individual words in the sentence and sum up their occurrences.
data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)
The output will give us a summary of how many times each two-word combination appears in our sentences.
Result
Here is what we get after running the code:
Sentence words words_occur total
0 beautiful day suffered through beautiful day 1 2
1 beautiful day suffered through day suffered 1 1
2 beautiful day suffered through suffered through 1 1
3 cannot hold back tears cannot hold 1 1
4 cannot hold back tears hold back 1 1
5 cannot hold back tears back tears 1 2
6 ash back tears beautiful day ash back 1 1
7 ash back tears beautiful day back tears 1 2
8 ash back tears beautiful day tears beautiful 1 1
9 ash back tears beautiful day beautiful day 1 2
This result shows us how many times each two-word combination appears in our original sentences.
Conclusion
By following these steps, we have successfully counted the frequency of two-word combinations in all rows of a column using Python. This approach can be applied to any dataset containing text data and multiple word combinations.
Last modified on 2023-11-09