How to Scrape a Website That Contains Multiple Tables and Convert Them into a Workable DataFrame Using Beautiful Soup and Pandas

Web Scraping and Data Analysis with Beautiful Soup and Pandas

In this article, we will explore how to scrape a website that contains multiple tables and convert them into a workable DataFrame using Python’s Beautiful Soup library for web scraping and the Pandas library for data manipulation.

Understanding Web Scraping

Web scraping is the process of automatically extracting data from websites. It involves using specialized algorithms and tools to navigate a website, locate the desired data, and then extract it. Beautiful Soup is a powerful tool that allows us to parse HTML and XML documents and make it easier to scrape websites.

Understanding Pandas DataFrames

A Pandas DataFrame is a two-dimensional table of data with rows and columns. It provides a convenient way to store and manipulate structured data in Python. DataFrames are similar to Excel spreadsheets or SQL tables.

Scrape Three Tables from the Website

The website we will be scraping contains three tables, each containing 250 Spanish words and their translations into English. The URL of the website is https://www.happyhourspanish.com/learning-efficiently-start-with-the-250-most-common-spanish-words/.

## Step 1: Import Required Libraries

To scrape the website, we need to import the Beautiful Soup library and the request library to send HTTP requests.
```python
import pandas as pd
from bs4 import BeautifulSoup
import requests

Send an HTTP Request to the Website

Next, we will send an HTTP request to the website using the requests library.

## Step 2: Send an HTTP Request to the Website

We use the `requests` library to send an HTTP GET request to the website.
```python
url = "https://www.happyhourspanish.com/learning-efficiently-start-with-the-250-most-common-spanish-words/"
response = requests.get(url)

Parse the HTML Document with Beautiful Soup

We will then parse the HTML document using Beautiful Soup.

## Step 3: Parse the HTML Document with Beautiful Soup

Beautiful Soup creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
```python
soup = BeautifulSoup(response.content, 'html.parser')

Find All Tables on the Website

Next, we will find all tables on the website using the find_all method.

## Step 4: Find All Tables on the Website

We use the `find_all` method to find all tables on the website.
```python
tables = soup.find_all('table')

Convert Each Table into a DataFrame

Finally, we will convert each table into a DataFrame using the read_html function.

## Step 5: Convert Each Table into a DataFrame

We use the `read_html` function to convert each table into a DataFrame.
```python
dfs = pd.read_html(url, header=0)

Let’s print the DataFrames to see what we have.

## Step 6: Print the DataFrames

We can print the DataFrames using the `print` function.
```python
for df in dfs:
    print(df)

However, when we try to save the DataFrames to a CSV file, we get an error saying that we cannot save a list object into a CSV.

Save the DataFrames to a CSV File

To fix this issue, we can use the pd.concat function to concatenate all the DataFrames together into one DataFrame.

## Step 7: Save the DataFrames to a CSV File

We use the `pd.concat` function to concatenate all the DataFrames together into one DataFrame.
```python
combined_df = pd.concat(dfs)

Save the Combined DataFrame to a CSV File

Finally, we can save the combined DataFrame to a CSV file using the to_csv function.

## Step 8: Save the Combined DataFrame to a CSV File

We use the `to_csv` function to save the combined DataFrame to a CSV file.
```python
combined_df.to_csv("Spanish_Key2.csv", index=False)

Alternative Approach: Using Beautiful Soup and Pandas Together

However, when we try to directly access the DataFrames using dfs[0], it only puts the first table into a workable DataFrame. We need to use Beautiful Soup and Pandas together.

## Step 9: Use Beautiful Soup and Pandas Together

We can create a DataFrame from each table using Beautiful Soup.
```python
spanish = pd.DataFrame([table.find('tr').get_text() for table in tables])

Save the DataFrame to a CSV File

Finally, we can save the DataFrame to a CSV file.

## Step 10: Save the DataFrame to a CSV File

We use the `to_csv` function to save the DataFrame to a CSV file.
```python
spanish.to_csv("Spanish_Key.csv", index=False)

Conclusion

In this article, we have learned how to scrape a website that contains multiple tables and convert them into a workable DataFrame using Python’s Beautiful Soup library for web scraping and the Pandas library for data manipulation. We also explored alternative approaches to solve the same problem.

Example Use Cases

  • Web scraping
  • Data analysis
  • Machine learning
  • Automation

Code Snippets

Here is the complete code snippet that we have used in this article:

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = "https://www.happyhourspanish.com/learning-efficiently-start-with-the-250-most-common-spanish-words/"

response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

tables = soup.find_all('table')

dfs = pd.read_html(url, header=0)

for df in dfs:
    print(df)

combined_df = pd.concat(dfs)

combined_df.to_csv("Spanish_Key2.csv", index=False)

spanish = pd.DataFrame([table.find('tr').get_text() for table in tables])

spanish.to_csv("Spanish_Key.csv", index=False)

Note: The url variable should be replaced with the URL of the website that you want to scrape.


Last modified on 2024-12-29