Processing Multiple CSV Files in Python Using Multi-Threading

Process Multiple CSV Files in Python

Introduction

In this article, we will explore how to process multiple CSV files in Python using a multi-threaded approach. We will cover the basics of working with CSV files, merging them together, and calculating totals for specific columns.

Background

Python is an excellent language for data analysis and processing due to its simplicity and extensive libraries. The pandas library is particularly useful for handling CSV files. It provides efficient data structures and operations for data manipulation.

The glob library is used to find all the CSV files in a specified directory. Once found, we can loop through each file, read it into a pandas DataFrame, and concatenate them together using pd.concat.

Setting Up the Environment

To follow along with this tutorial, you will need:

  • Python installed on your system (preferably the latest version)
  • The pandas, glob, and os libraries
  • A CSV file containing data for each item to be processed

Merging Multiple CSV Files

We’ll begin by importing the necessary libraries.

import glob
import os
import pandas as pd

Next, we need to specify the path to our CSV files. We can use glob.glob() to find all CSV files in a specified directory.

mycsvdir = 'C:\\your csv location\\your csv location'

Please replace 'C:\\your csv location\\your csv location' with your actual CSV file path.

We will then select all the CSV files using glob.glob(), which returns a list of paths to our CSV files.

csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))

The '*.csv' part tells glob.glob() to look for files with the .csv extension.

Now that we have a list of all CSV files in our directory, we can loop through each file and read it into a pandas DataFrame using pd.read_csv(). We will store these DataFrames in a list called dataframes.

dataframes = []
for csvfile in csvfiles:
    df = pd.read_csv(csvfile)
    dataframes.append(df)

Concatenating DataFrames

Once we have all the DataFrames, we can concatenate them together using pd.concat(). The ignore_index=True argument tells pandas not to create an index for our new DataFrame.

result = pd.concat(dataframes, ignore_index=True)

Saving the Merged Data

We will now save our merged data to a new CSV file called all.csv.

result.to_csv('all.csv', index=False)

The index=False argument tells pandas not to include the index in the output CSV file.

Calculating Totals for Specific Columns

To calculate totals for specific columns, we can use pd.pivot_table(). We need to specify the columns we want to sum and how we want to group them. For this example, let’s say we want to sum the total amount for each item.

dff = pd.read_csv('C:\\output folder\\output folder\\all.csv')

table = pd.pivot_table(dff, index=['items', 'per_unit_amount'], values='number of units')
print(table)

The index parameter specifies the columns we want to group by. The values parameter specifies the column we want to sum.

Performance Improvement with Multi-Threading

While the code above merges all CSV files and calculates totals for each item, it’s not the most efficient way to do this task. To improve performance, we can use multi-threading.

We’ll create a separate thread for each CSV file, which will read the file into a DataFrame and calculate the total amount in parallel. This approach is much faster than reading all files sequentially.

Example of Multi-Threading

Here’s how you could modify our code to take advantage of multiple cores:

import glob
import os
import pandas as pd
from multiprocessing import Pool

def process_csv(csvfile):
    df = pd.read_csv(csvfile)
    table = pd.pivot_table(df, index=['items', 'per_unit_amount'], values='number of units')
    return table

if __name__ == '__main__':
    mycsvdir = 'C:\\your csv location\\your csv location'

    csvfiles = glob.glob(os.path.join(mycsvdir, '*.csv'))

    with Pool() as pool:
        tables = pool.map(process_csv, csvfiles)

    # concatenate the results
    result = pd.concat(tables, ignore_index=True)
    result.to_csv('all.csv', index=False)

The Pool class from the multiprocessing module creates a pool of worker processes that can execute tasks concurrently. The map() method applies our process_csv() function to each CSV file in the list and returns an array of results.

Please note that multi-threading is not always faster than sequential processing for I/O-bound operations like reading CSV files. If you have a large number of small CSV files, it might actually slow down your program due to the overhead of context switching between threads.


Last modified on 2024-02-23