Counting Two-Word Combinations in Text Data with Python

Introduction

In this article, we will explore how to count the frequency of two-word combinations in all rows of a column using Python and its popular libraries. The problem is related to text processing, specifically bigram tokenization, which involves splitting sentences into pairs of consecutive words.

We’ll walk through a step-by-step approach, starting from preparing our data, cleaning it up, and then counting the frequency of two-word combinations.

Preparing the Data

To start with this task, you need a pandas DataFrame containing your text data. Here’s how we can prepare the data.

data = pd.DataFrame({
    'Sentence': [
        'beautiful day suffered through',
        'beautiful day suffered through',
        'beautiful day suffered through',
        'cannot hold back tears',
        'cannot hold back tears',
        'cannot hold back tears',
        'ash back tears beautiful day',
        'ash back tears beautiful day',
        'ash back tears beautiful day',
        'ash back tears beautiful day'
    ],
    'words': [
        'beautiful day',
        'day suffered',
        'suffered through',
        'cannot hold',
        'hold back',
        'back tears',
        'ash back',
        'back tears',
        'tears beautiful',
        'beautiful day'
    ]
})

Cleaning the Data

Before we can start counting, we need to clean up our data by removing quotes and whitespaces at the beginning and end of both Sentence and words.

data = data.apply(lambda x: x.str.replace("'", ""))
data["Sentence"] = data["Sentence"].str.strip()
data["words"] = data["words"].str.strip()

Setting Data Types

We also need to ensure that the Sentence and words columns are of type string.

data = data.astype({"Sentence": str, "words": str})

Counting Two-Word Combinations

To count the frequency of two-word combinations in each sentence on the same row, we’ll create a function called words_in_sent.

def words_in_sent(row):
    return row["Sentence"].count(row["words"])
data["words_occur"] = data.apply(words_in_sent, axis=1)

Grouping and Summing

Finally, we group by the individual words in the sentence and sum up their occurrences.

data["total"] = data["words_occur"].groupby(data["words"]).transform("sum")
print(data)

The output will give us a summary of how many times each two-word combination appears in our sentences.

Result

Here is what we get after running the code:

                         Sentence          words  words_occur   total
0   beautiful day suffered through     beautiful day           1     2
1   beautiful day suffered through      day suffered           1     1
2   beautiful day suffered through  suffered through           1     1
3           cannot hold back tears       cannot hold           1     1
4           cannot hold back tears         hold back           1     1
5           cannot hold back tears        back tears           1     2
6     ash back tears beautiful day          ash back           1     1
7     ash back tears beautiful day        back tears           1     2
8     ash back tears beautiful day   tears beautiful           1     1
9     ash back tears beautiful day     beautiful day           1     2

This result shows us how many times each two-word combination appears in our original sentences.

Conclusion

By following these steps, we have successfully counted the frequency of two-word combinations in all rows of a column using Python. This approach can be applied to any dataset containing text data and multiple word combinations.


Last modified on 2023-11-09