Creating Structured Data Frame from Multiple Arrays and Lists Using Pandas Library

Creating Structured Data Frame from Multiple Arrays and Lists

In this article, we will explore how to create a structured data frame using multiple arrays and lists in Python. We’ll use the pandas library to achieve this.

Introduction

When working with large datasets, it’s common to have multiple arrays or lists that need to be combined into a single structure. This can be especially challenging when dealing with different data types and formats. In this article, we’ll demonstrate how to create a structured data frame from multiple arrays and lists using the pandas library.

Sample Data

To illustrate this concept, let’s consider an example where we have three lists:

  • distributors: a list of distributor names
  • products: a list of product names
  • tips: a list of tip categories (e.g., “fruit”, “vegetables”, etc.)
  • Two arrays: actual_prix and prix_prox_year, which represent the actual prices and predicted prices for each product
import numpy as np

distributors = ['d1', 'd2', 'd3', 'd4', 'd5']
products = ['apple', 'carrot', 'potato', 'avocado', 'pumkie', 'banana',
            'kiwi', 'lettuce', 'tomato', 'pees', 'pear', 'berries', 'strawberries',
            'blueberries', 'boxes']
tips = ['fruit', 'vegetables', 'random']

actual_prix = np.arange(15*5).reshape(15,5)
prix_prox_year = np.random.rand(15,5)

Creating the Data Frame

To create a structured data frame from these arrays and lists, we can use the product function from the itertools library. This function generates all possible combinations of elements from the input iterables.

from itertools import product
import pandas as pd

df = (pd.DataFrame([*product(products, tips, distributors)],
                   columns=['Products', 'Type', 'Distributor'])
        .assign(Actual = np.tile(actual_prix, len(tips)).ravel(),
                Next_year = np.tile(prix_prox_year, len(tips)).ravel()))

Here’s a breakdown of what happens in the code:

  1. We import the necessary libraries: pandas for data manipulation and numpy for numerical operations.
  2. We define the input lists and arrays.
  3. We use the product function to generate all possible combinations of product names, tip categories, and distributor names.
  4. We create a pandas DataFrame from these combinations using the pd.DataFrame() constructor.
  5. We assign column names to the resulting DataFrame.

Assigning Additional Columns

To complete our data frame, we need to add two additional columns: Actual and Next_year. These represent the actual prices and predicted prices for each product, respectively.

df = (pd.DataFrame([*product(products, tips, distributors)],
                   columns=['Products', 'Type', 'Distributor'])
        .assign(Actual = np.tile(actual_prix, len(tips)).ravel(),
                Next_year = np.tile(prix_prox_year, len(tips)).ravel())

Printing the Data Frame

Finally, we can print the resulting data frame to verify its contents.

print(df)

The output will be a structured data frame with all four columns: Products, Type, Distributor, Actual, and Next_year.

Example Output

Here’s an example of what the output might look like:

ProductsTypeDistributorActualNext_year
applefruitd100.391903
applefruitd210.378865
applefruitd320.056134
applefruitd430.623146
applefruitd540.879184

… (and so on for all combinations of products, tips, and distributors)

Conclusion

In this article, we demonstrated how to create a structured data frame using multiple arrays and lists in Python. We used the pandas library to achieve this, leveraging its powerful data manipulation capabilities. By following these steps, you can easily create your own data frames from large datasets and start exploring new insights and patterns.


Last modified on 2024-05-02