Webscraping Data Table from Sports Website using rvest
Introduction
Webscraping is the process of extracting data from websites. In this blog post, we will focus on how to webscrape a specific table from a sports website using R and its associated libraries, specifically rvest
.
Background
The National Rugby League (NRL) website provides up-to-date information about various rugby league competitions around the world. The ladder page of their website contains the competition table for each round, which can be useful for data analysis or other purposes.
The NRL ladder page is built using HTML and CSS, with dynamic content generated by a server-side language such as PHP. However, the structure of this content is still susceptible to extraction using web scraping techniques.
Step 1: Prepare R Environment
To begin webscraping, we need to install and load necessary R libraries. First, let’s install rvest
and other required packages:
# Install rvest and other packages
install.packages("rvest")
install.packages("jsonlite")
Now, we can load these packages in our R environment:
## Load required packages
library(rvest)
library(jsonlite)
Step 2: Inspect the Website Structure
We need to inspect the website structure using Inspect Element
or Chrome DevTools
. Open the NRL ladder page in a web browser and open the developer tools.
By inspecting the HTML code of the webpage, we can find out where the table data is stored. Let’s look at the HTML structure for each row in the competition table:
## Inspect website structure using Chrome DevTools
Inspect Element:
- HTML: <div class="row">
- Class: row
- Attributes: {"data-attribute": "q-data"}
In this case, the data is stored inside a div
element with class row
and an attribute named data-attribute
.
Step 3: Extract Data from Website
We can now write R code to extract the table data:
## Load URL in R
url <- paste0("https://www.nrl.com/ladder//data?competition=111&round=27&season=2023")
## Read HTML content of website using rvest
page <- read_html(url)
## Find all div elements with class row
div_nodes <- page %>%
html_nodes("div") %>%
select(where(is.character)) %>%
filter(stringContains("row"))
## Extract data attribute from each div element
data_attributes <- div_nodes %>%
pull(data_attribute)
## Convert data attributes to JSON format
json_data <- jsonlite::fromJSON(data_attributes)
However, this code still doesn’t produce the desired result. To get around this issue, we need to use httr2
and other techniques.
Step 4: Use httr2 for Web Scraping
Let’s start by installing and loading necessary packages:
## Install and load required packages
install.packages("httr2")
library(httr2)
We can now read the JSON data directly from the NRL ladder page using httr2
:
## Read URL in R
url <- paste0("https://www.nrl.com/ladder//data?competition=111&round=27&season=2023")
## Send HTTP GET request to website
response <- GET(url)
## Get response body as JSON format
json_response <- content(response, "text") %>%
fromJSON(simplifyVector = TRUE) %>%
pluck("positions")
This code sends an HTTP GET request to the NRL ladder page and extracts the competition table data in JSON format. It then converts this JSON data into a vector using pluck
function.
Step 5: Convert JSON Data to tibble
We can now convert the extracted JSON data into a tibble:
## Convert JSON data to tibble
tibble_data <- json_response %>%
as_tibble() %>%
unnest(everything())
This step converts the JSON data vector into a tibble using as_tibble
function. It then un nests each element of the tibble using unnest
function.
Step 6: Display Competition Table Data
Finally, we can display the competition table data:
## Display competition table data
tibble_data %>%
print()
This code displays the extracted and converted competition table data in a neat format.
Conclusion
In this blog post, we learned how to webscrape the competition table from the NRL ladder page using R. We went through each step of the process, from inspecting the website structure to displaying the extracted data in a tibble format. The rvest
and httr2
libraries were used for web scraping, with jsonlite
library used to convert JSON data into R format.
The final code block provided above can be saved as an R script file (e.g., webscrape_nrl_ladder.R
) in your working directory and run using source("webscrape_nrl_ladder.R")
.
Last modified on 2023-09-12