Creating a Nested List of DataFrames using For Loop and pd.read_excel

Introduction

In this article, we will explore how to create a nested list of DataFrames from multiple Excel files located in different folders. We will use the pandas library for data manipulation and the os library for file system operations.

Background

When working with large datasets, it is often necessary to perform data analysis on multiple files simultaneously. This can be achieved by using nested loops to iterate over each file and then concatenate the resulting DataFrames into a single list.

However, when working with pandas DataFrames, it can be challenging to handle complex nesting scenarios. In this article, we will explore an approach that uses for loops and pd.read_excel to create a nested list of DataFrames.

Step 1: Defining the Folder Paths

The first step in creating a nested list of DataFrames is to define the folder paths where the Excel files are located. We can use the os library to list the files in each folder.

# Importing necessary libraries
import pandas as pd
import os

# Defining the folder paths
path_2016 = 'D:/2016'
path_2017 = 'D:/2017'
path_2018 = 'D:/2018'

# Creating a list of folder paths
path = [path_2016, path_2017, path_2018]

Step 2: Initializing the Empty Lists

Next, we need to initialize two empty lists: df_list and df_list_concat. The df_list will store the individual DataFrames from each file, while the df_list_concat will store the concatenated list of DataFrames.

# Initializing the empty lists
df_list = []
df_list_concat = []

Step 3: Iterating Over Each File and Creating a DataFrame

We will use nested for loops to iterate over each file in each folder. For each file, we will create a new DataFrame using pd.read_excel.

# Iterating over each file and creating a DataFrame
for i in range(len(path)):
    for filename in os.listdir(path[i]):
        # Creating a new DataFrame from the Excel file
        df = pd.read_excel(f"{path[i]}/{filename}", skiprows=7)
        
        # Appending the DataFrame to the list
        df_list.append(df)
        
        # Concatenating the list of DataFrames
        df_list_concat.append(df_list[:])

Note that in the df_list_concat line, we are appending a copy of the current list (df_list) instead of the original list. This is because if we append the original list to itself, it will create an infinite recursion.

Step 4: Handling Errors and Optimizations

There are several potential issues that can arise when working with nested loops:

File not found errors: If a file is missing or cannot be read due to errors in the Excel file, we should add error handling to skip over those files.
Memory overflow errors: When dealing with large datasets, it’s possible for the memory usage to exceed available resources. We can optimize this by using generators instead of lists.
Performance overhead: Nested loops can result in performance overhead due to repeated file system operations. We can optimize this by using a more efficient method to read files from disk.

Conclusion

In conclusion, we have explored an approach for creating a nested list of DataFrames using for loops and pd.read_excel. While this solution works, it is not without its limitations. In the future, consider optimizing performance and handling errors in your code.

Code Optimization: Using Generators Instead of Lists

# Importing necessary libraries
import pandas as pd
import os

# Defining the folder paths
path_2016 = 'D:/2016'
path_2017 = 'D:/2017'
path_2018 = 'D:/2018'

# Creating a list of folder paths
path = [path_2016, path_2017, path_2018]

# Using generators to read Excel files
for i in range(len(path)):
    for filename in os.listdir(path[i]):
        # Creating a generator expression to yield DataFrames
        df_generator = (pd.read_excel(f"{path[i]}/{filename}", skiprows=7) 
                        for filename in os.listdir(path[i]))
        
        # Consuming the generator expression and storing the results in a list
        df_list.append(next(df_generator))

# Using generators to read Excel files and store the results in a single list
df_list_concat = [next(df_generator) for file in path for df_generator in (pd.read_excel(f"{file}/{filename}", skiprows=7) 
                                            for filename in os.listdir(file))]

In this optimized code, we use generator expressions instead of lists to read Excel files. The generator expressions produce DataFrames on the fly, reducing memory usage and improving performance.

By following these guidelines and optimizing your code for better performance, you can create efficient nested lists of DataFrames using pandas and Python.

Last modified on 2023-09-15