How to Create a 3 Column Data Frame Using Pandas for Data Scraping and Analysis

Creating a 3 Column Data Frame in Pandas

In this article, we will explore how to create a data frame with three columns using the pandas library in Python. We will also discuss how to scrape data from a website and fit it into our desired data structure.

Introduction to Pandas

Pandas is a powerful open-source library used for data manipulation and analysis in Python. It provides data structures such as Series (a one-dimensional labeled array) and DataFrame (a two-dimensional labeled data structure with columns of potentially different types).

The DataFrame is the primary data structure used by pandas, and it is widely used in many fields such as data science, scientific computing, and data analysis.

Creating a 3 Column Data Frame

To create a 3 column data frame, we can use the DataFrame constructor from pandas. We will define our columns using a list of strings and then create an empty DataFrame with those columns.

cols = ["player-name", "League", "team_name"]
df = pd.DataFrame(columns=cols)

Scraping Data from a Website

In this example, we are scraping data from a soccer website. We will use the requests library to send a GET request to the URL and then parse the HTML content using BeautifulSoup.

import requests
from bs4 import BeautifulSoup

url = 'https://ng.soccerway.com/players/players_abroad/nigeria/'
req = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
page = req.content
soup = BeautifulSoup(page, 'html.parser')
table_data = soup.find_all('td')
player_data_list=[data.text.strip() for data in table_data]

Post-Processing the Scraped Data

After scraping the data, we need to post-process it to fit our desired data structure. We will use the read_html function from pandas to read the scraped HTML content and then manipulate it using various methods such as filtering, sorting, and grouping.

tmp = pd.read_html(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).content)[0]
df = (
    tmp.T.reset_index().T # to slip down the incorrect 'England' header
        .assign(country=lambda x: x.pop(3).str.split(".").str[0].ffill())
        .iloc[1:].loc[tmp.iloc[:, -1].isna()]
        .set_axis(cols + ["country"], axis=1)
)

Output and Explanation

The final data frame df is printed to the console, showing the scraped data with three columns: “player-name”, “League”, and “team_name”. The country column is created using a lambda function that splits the team name at the dot (.) and takes the first part as the country.

print(df)

Output:

      player-name          League          team_name  country
0        A. Iwobi  Premier League             Fulham  England
1      T. Awoniyi  Premier League  Nottingham Forest  England
2         O. Aina  Premier League  Nottingham Forest  England
3       F. Onyeka  Premier League          Brentford  England
4       C. Bassey  Premier League             Fulham  England
...           ...             ...                ...      ...
1078   S. Danjuma   Yemeni League      Al Ahli San'a    Yemen
1079  M. Alhassan   Yemeni League    Yarmuk al Rawda    Yemen
1080     A. Nweze   Yemeni League    Yarmuk al Rawda    Yemen
1081  A. Olalekan   Yemeni League      Al Sha'ab Ibb    Yemen
1082     A. Adisa   Yemeni League          Al Urooba    Yemen

[975 rows x 4 columns]

Conclusion

In this article, we have explored how to create a data frame with three columns using pandas and scraped data from a website. We have also discussed various methods for post-processing the scraped data, including filtering, sorting, and grouping.

We hope that this article has provided you with a better understanding of working with pandas and scraping data from websites in Python. If you have any questions or need further clarification on any of the concepts discussed in this article, please don’t hesitate to ask.


Last modified on 2023-08-11