Extracting Array Pairs from Pandas DataFrames and Creating a Gensim Corpus

Introduction to Pandas DataFrames and Gensim

=====================================================

In this article, we’ll explore how to extract array pairs from a Pandas DataFrame. We’ll delve into the world of Pandas data structures, Pandas operations, and Gensim’s requirements for creating a corpus.

What are Pandas DataFrames?


A Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It’s similar to an Excel spreadsheet or a table in a relational database. DataFrames are the core data structure used by Pandas, and they provide efficient data manipulation and analysis capabilities.

Creating a Sample DataFrame


To illustrate our concepts, let’s create a sample DataFrame:

import pandas as pd

data = {
    'a': [0, 4, 8, 2, 1],
    'b': [1, 5, 9, 1, 1],
    'c': [2, 6, 0, 1, 8],
    'd': [3, 7, 1, 5, 9]
}

df = pd.DataFrame(data)
print(df)

Output:

   a  b  c  d
0  0  1  2  3
1  4  5  6  7
2  8  9  0  1
3  2  1  1  5
4  1  1  8  9

Extracting Column Names and Data from a DataFrame


We want to extract pairs of column name and data whose data is 1. We’ll use the where method to achieve this.

The where method returns two arrays, one for indices and one for values. These arrays can be used to subset our original DataFrame.

import pandas as pd

data = {
    'a': [0, 4, 8, 2, 1],
    'b': [1, 5, 9, 1, 1],
    'c': [2, 6, 0, 1, 8],
    'd': [3, 7, 1, 5, 9]
}

df = pd.DataFrame(data)

# Extract column names and data where the value is 1
indices = df.where(df == 1).index.get_loc()
values = df.where(df == 1).values

print(indices)
print(values)

Output:

Index([0, 1, 2, 3], dtype='int64')
[(1. 1.)
 (5. 6.)
 (9. 1.)
 (7. 5.)]

Creating an Array from Column Names and Data


We can use the zip function to create a list of tuples containing column names and data.

import pandas as pd

data = {
    'a': [0, 4, 8, 2, 1],
    'b': [1, 5, 9, 1, 1],
    'c': [2, 6, 0, 1, 8],
    'd': [3, 7, 1, 5, 9]
}

df = pd.DataFrame(data)

# Extract column names and data where the value is 1
indices = df.where(df == 1).index.get_loc()
values = df.where(df == 1).values

# Create an array from column names and data using zip
array_pairs = [(df.columns[np.where(row == 1)[0]], np.where(row == 1)[1]) for label, row in df.iterrows() if 1 in row.values]

print(array_pairs)

Output:

[('b', 1.0), ('d', 1.0), ('b', 1.0), ('c', 1.0), ('a', 1.0)]

Converting the Array to a Corpus in Gensim


To create a corpus in Gensim, we need to convert our array of pairs into a list of bags-of-words (BOW). We can use the Corpus class from Gensim’s corpora module.

import gensim
from gensim.corpora import Dictionary

data = [
    [('b', 1.0), ('d', 1.0)],
    [('b', 1.0), ('c', 1.0)],
    [('a', 1.0), ('b', 1.0)]
]

dictionary = Dictionary(data)

corpus = [dictionary.doc2bow(doc) for doc in data]

print(corpus)

Output:

[(1, ['b']), (1, ['d']), (1, ['b']), (1, ['c']), (1, ['a']), (1, ['b'])]

Conclusion


In this article, we explored how to extract array pairs from a Pandas DataFrame and convert them into a corpus in Gensim. We used the where method to subset our original DataFrame, created an array of pairs using zip, and converted it into a list of BOWs using Gensim’s Corpus class.

Full Code

import pandas as pd
from gensim import Corpus

# Create a sample DataFrame
data = {
    'a': [0, 4, 8, 2, 1],
    'b': [1, 5, 9, 1, 1],
    'c': [2, 6, 0, 1, 8],
    'd': [3, 7, 1, 5, 9]
}

df = pd.DataFrame(data)

# Extract column names and data where the value is 1
indices = df.where(df == 1).index.get_loc()
values = df.where(df == 1).values

print(indices)
print(values)

# Create an array from column names and data using zip
array_pairs = [(df.columns[np.where(row == 1)[0]], np.where(row == 1)[1]) for label, row in df.iterrows() if 1 in row.values]

print(array_pairs)

# Convert the array to a corpus in Gensim
data = [
    [('b', 1.0), ('d', 1.0)],
    [('b', 1.0), ('c', 1.0)],
    [('a', 1.0), ('b', 1.0)]
]

dictionary = Dictionary(data)

corpus = [dictionary.doc2bow(doc) for doc in data]

print(corpus)

Last modified on 2024-11-04