Creating a 3 Column Data Frame in Pandas
In this article, we will explore how to create a data frame with three columns using the pandas library in Python. We will also discuss how to scrape data from a website and fit it into our desired data structure.
Introduction to Pandas
Pandas is a powerful open-source library used for data manipulation and analysis in Python. It provides data structures such as Series (a one-dimensional labeled array) and DataFrame (a two-dimensional labeled data structure with columns of potentially different types).
The DataFrame is the primary data structure used by pandas, and it is widely used in many fields such as data science, scientific computing, and data analysis.
Creating a 3 Column Data Frame
To create a 3 column data frame, we can use the DataFrame
constructor from pandas. We will define our columns using a list of strings and then create an empty DataFrame with those columns.
cols = ["player-name", "League", "team_name"]
df = pd.DataFrame(columns=cols)
Scraping Data from a Website
In this example, we are scraping data from a soccer website. We will use the requests
library to send a GET request to the URL and then parse the HTML content using BeautifulSoup.
import requests
from bs4 import BeautifulSoup
url = 'https://ng.soccerway.com/players/players_abroad/nigeria/'
req = requests.get(url, headers={'User-Agent':'Mozilla/5.0'})
page = req.content
soup = BeautifulSoup(page, 'html.parser')
table_data = soup.find_all('td')
player_data_list=[data.text.strip() for data in table_data]
Post-Processing the Scraped Data
After scraping the data, we need to post-process it to fit our desired data structure. We will use the read_html
function from pandas to read the scraped HTML content and then manipulate it using various methods such as filtering, sorting, and grouping.
tmp = pd.read_html(requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).content)[0]
df = (
tmp.T.reset_index().T # to slip down the incorrect 'England' header
.assign(country=lambda x: x.pop(3).str.split(".").str[0].ffill())
.iloc[1:].loc[tmp.iloc[:, -1].isna()]
.set_axis(cols + ["country"], axis=1)
)
Output and Explanation
The final data frame df
is printed to the console, showing the scraped data with three columns: “player-name”, “League”, and “team_name”. The country column is created using a lambda function that splits the team name at the dot (.) and takes the first part as the country.
print(df)
Output:
player-name League team_name country
0 A. Iwobi Premier League Fulham England
1 T. Awoniyi Premier League Nottingham Forest England
2 O. Aina Premier League Nottingham Forest England
3 F. Onyeka Premier League Brentford England
4 C. Bassey Premier League Fulham England
... ... ... ... ...
1078 S. Danjuma Yemeni League Al Ahli San'a Yemen
1079 M. Alhassan Yemeni League Yarmuk al Rawda Yemen
1080 A. Nweze Yemeni League Yarmuk al Rawda Yemen
1081 A. Olalekan Yemeni League Al Sha'ab Ibb Yemen
1082 A. Adisa Yemeni League Al Urooba Yemen
[975 rows x 4 columns]
Conclusion
In this article, we have explored how to create a data frame with three columns using pandas and scraped data from a website. We have also discussed various methods for post-processing the scraped data, including filtering, sorting, and grouping.
We hope that this article has provided you with a better understanding of working with pandas and scraping data from websites in Python. If you have any questions or need further clarification on any of the concepts discussed in this article, please don’t hesitate to ask.
Last modified on 2023-08-11