How to Read Multiple Excel Files and Create a New Spreadsheet with Combined Data Using Python

Reading Multiple Excel Files and Creating a New Spreadsheet with Combined Data

As an individual or organization, having multiple Excel files scattered across different folders can be a challenge. Imagine needing to access specific data from each file, such as names and emails, and then combine them into a single spreadsheet for easier analysis or reporting. In this article, we will explore how to read multiple Excel files, extract specific data, and save it to a new Excel file using Python.

Overview of the Problem

We have multiple Excel files in a folder with different names, and we want to copy the same cell (e.g., B3) from all files and create a column in a new Excel file. We also need to import names and emails from these files and save them to a single new Excel file.

Solution Overview

To solve this problem, we will use Python’s glob module to iterate through the folder and find all Excel files. We’ll then use the pandas library to read each file, extract the desired data (in this case, specific cells), combine the data into a single series, and finally save it to a new Excel file.

Step 1: Setting Up Our Python Environment

To start working on this project, you will need to have Python installed on your computer. You can download the latest version from the official Python website if you haven’t already. Additionally, make sure you have the pandas and glob libraries installed. The pandas library is a powerful data analysis tool that allows us to read and manipulate Excel files.

# Install required libraries
pip install pandas glob

Step 2: Importing Libraries and Setting Up Variables

We will import the necessary libraries, set up some variables for our folder path and file name, and then define a function to handle the iteration through all Excel files.

import pandas as pd
import glob
from pathlib import Path

# Define the folder path and file name
folder_path = "/path/to/your/folder"
file_name = "new_file.xlsx"

# Set up variables for cell extraction (e.g., row, column)
row_number = 3
column_number = 5

# Initialize a list to store extracted data
cells = []

Step 3: Iterating Through Excel Files and Extracting Data

Now that we have our libraries imported, folder path defined, file name set up, and variables for cell extraction ready, it’s time to start iterating through all Excel files in the specified folder.

# Loop through all Excel files in the folder
for f in glob.glob(f"{folder_path}/*.xlsx"):
    # Read the current Excel file into a DataFrame
    data = pd.read_excel(f, 'Sheet1')
    
    # Extract specific cell and append it to our list
    cells.append(data.iloc[row_number-1, column_number-1])

Step 4: Combining Data and Saving to New Excel File

With all data extracted from the individual files, we can now create a new DataFrame that combines these values into one series. We’ll use pd.Series to convert our list of values into a pandas Series.

# Combine data into a single Series
series_data = pd.Series(cells)

# Save this combined data to a new Excel file
series_data.to_excel(file_name, index=False)

Step 5: Explanation and Additional Considerations

What if I Want to Copy Multiple Cells from Each File?

To extract multiple cells, simply replace the iloc[row_number-1, column_number-1] line with your desired cell extraction syntax.

cells.append(data.iloc[3,6])

You could also add more complex logic by checking one cell’s value and then deciding which other cell to look at next based on that value. This would require additional code for conditional statements (e.g., if, elif).

Using Other File Formats

The above example is specifically designed to work with Excel files in the .xlsx format using the pandas library. If you need to handle other file formats, consider using libraries like openpyxl or xlrd for reading and writing Excel files.

For instance, if working with older Excel files that use the .xls format:

data = pd.read_excel(f, 'Sheet1', engine='xlrd')

And don’t forget to install any additional required libraries (openpyxl, xlrd) by running:

pip install openpyxl xlrd

Handling Folder and File Path Issues

If the folder path contains spaces or is not accessible, make sure to enclose it in quotes when defining folder_path. Also, check that you have permission to access this directory.

folder_path = r"C:\Users\YourUsername\Documents\folder_with_excel_files"

In case of permission errors or file not found issues during iteration through files, consider adding error handling mechanisms within your loop (try-except blocks).

Conclusion

This guide demonstrated how to read multiple Excel files in a specified folder, extract specific data from each file (e.g., cell B3), combine the extracted values into a new DataFrame, and save this combined data to a single new Excel file. Python’s pandas library is used for its ability to handle complex data analysis tasks efficiently.

While our example covered basic operations, there are numerous possibilities for expanding upon this process (e.g., handling multiple sheets within a spreadsheet, extracting more specific information from each file). This should provide you with the foundation needed to tackle various Excel-related tasks involving Python programming.


Last modified on 2024-07-18