Reading Multiple Excel Files and Creating a New Spreadsheet with Combined Data
As an individual or organization, having multiple Excel files scattered across different folders can be a challenge. Imagine needing to access specific data from each file, such as names and emails, and then combine them into a single spreadsheet for easier analysis or reporting. In this article, we will explore how to read multiple Excel files, extract specific data, and save it to a new Excel file using Python.
Overview of the Problem
We have multiple Excel files in a folder with different names, and we want to copy the same cell (e.g., B3) from all files and create a column in a new Excel file. We also need to import names and emails from these files and save them to a single new Excel file.
Solution Overview
To solve this problem, we will use Python’s glob
module to iterate through the folder and find all Excel files. We’ll then use the pandas
library to read each file, extract the desired data (in this case, specific cells), combine the data into a single series, and finally save it to a new Excel file.
Step 1: Setting Up Our Python Environment
To start working on this project, you will need to have Python installed on your computer. You can download the latest version from the official Python website if you haven’t already. Additionally, make sure you have the pandas
and glob
libraries installed. The pandas
library is a powerful data analysis tool that allows us to read and manipulate Excel files.
# Install required libraries
pip install pandas glob
Step 2: Importing Libraries and Setting Up Variables
We will import the necessary libraries, set up some variables for our folder path and file name, and then define a function to handle the iteration through all Excel files.
import pandas as pd
import glob
from pathlib import Path
# Define the folder path and file name
folder_path = "/path/to/your/folder"
file_name = "new_file.xlsx"
# Set up variables for cell extraction (e.g., row, column)
row_number = 3
column_number = 5
# Initialize a list to store extracted data
cells = []
Step 3: Iterating Through Excel Files and Extracting Data
Now that we have our libraries imported, folder path defined, file name set up, and variables for cell extraction ready, it’s time to start iterating through all Excel files in the specified folder.
# Loop through all Excel files in the folder
for f in glob.glob(f"{folder_path}/*.xlsx"):
# Read the current Excel file into a DataFrame
data = pd.read_excel(f, 'Sheet1')
# Extract specific cell and append it to our list
cells.append(data.iloc[row_number-1, column_number-1])
Step 4: Combining Data and Saving to New Excel File
With all data extracted from the individual files, we can now create a new DataFrame that combines these values into one series. We’ll use pd.Series
to convert our list of values into a pandas Series.
# Combine data into a single Series
series_data = pd.Series(cells)
# Save this combined data to a new Excel file
series_data.to_excel(file_name, index=False)
Step 5: Explanation and Additional Considerations
What if I Want to Copy Multiple Cells from Each File?
To extract multiple cells, simply replace the iloc[row_number-1, column_number-1]
line with your desired cell extraction syntax.
cells.append(data.iloc[3,6])
You could also add more complex logic by checking one cell’s value and then deciding which other cell to look at next based on that value. This would require additional code for conditional statements (e.g., if
, elif
).
Using Other File Formats
The above example is specifically designed to work with Excel files in the .xlsx
format using the pandas
library. If you need to handle other file formats, consider using libraries like openpyxl
or xlrd
for reading and writing Excel files.
For instance, if working with older Excel files that use the .xls
format:
data = pd.read_excel(f, 'Sheet1', engine='xlrd')
And don’t forget to install any additional required libraries (openpyxl
, xlrd
) by running:
pip install openpyxl xlrd
Handling Folder and File Path Issues
If the folder path contains spaces or is not accessible, make sure to enclose it in quotes when defining folder_path
. Also, check that you have permission to access this directory.
folder_path = r"C:\Users\YourUsername\Documents\folder_with_excel_files"
In case of permission errors or file not found issues during iteration through files, consider adding error handling mechanisms within your loop (try-except
blocks).
Conclusion
This guide demonstrated how to read multiple Excel files in a specified folder, extract specific data from each file (e.g., cell B3), combine the extracted values into a new DataFrame, and save this combined data to a single new Excel file. Python’s pandas
library is used for its ability to handle complex data analysis tasks efficiently.
While our example covered basic operations, there are numerous possibilities for expanding upon this process (e.g., handling multiple sheets within a spreadsheet, extracting more specific information from each file). This should provide you with the foundation needed to tackle various Excel-related tasks involving Python programming.
Last modified on 2024-07-18