Webscraping Data Table from Sports Website using rvest

Introduction

Webscraping is the process of extracting data from websites. In this blog post, we will focus on how to webscrape a specific table from a sports website using R and its associated libraries, specifically rvest.

Background

The National Rugby League (NRL) website provides up-to-date information about various rugby league competitions around the world. The ladder page of their website contains the competition table for each round, which can be useful for data analysis or other purposes.

The NRL ladder page is built using HTML and CSS, with dynamic content generated by a server-side language such as PHP. However, the structure of this content is still susceptible to extraction using web scraping techniques.

Step 1: Prepare R Environment

To begin webscraping, we need to install and load necessary R libraries. First, let’s install rvest and other required packages:

# Install rvest and other packages
install.packages("rvest")
install.packages("jsonlite")

Now, we can load these packages in our R environment:

## Load required packages
library(rvest)
library(jsonlite)

Step 2: Inspect the Website Structure

We need to inspect the website structure using Inspect Element or Chrome DevTools. Open the NRL ladder page in a web browser and open the developer tools.

By inspecting the HTML code of the webpage, we can find out where the table data is stored. Let’s look at the HTML structure for each row in the competition table:

## Inspect website structure using Chrome DevTools
Inspect Element:
  - HTML: <div class="row">
    - Class: row
    - Attributes: {"data-attribute": "q-data"}

In this case, the data is stored inside a div element with class row and an attribute named data-attribute.

Step 3: Extract Data from Website

We can now write R code to extract the table data:

## Load URL in R
url <- paste0("https://www.nrl.com/ladder//data?competition=111&round=27&season=2023")

## Read HTML content of website using rvest
page <- read_html(url)

## Find all div elements with class row
div_nodes <- page %>% 
  html_nodes("div") %>% 
  select(where(is.character)) %>% 
  filter(stringContains("row"))

## Extract data attribute from each div element
data_attributes <- div_nodes %>% 
  pull(data_attribute)

## Convert data attributes to JSON format
json_data <- jsonlite::fromJSON(data_attributes)

However, this code still doesn’t produce the desired result. To get around this issue, we need to use httr2 and other techniques.

Step 4: Use httr2 for Web Scraping

Let’s start by installing and loading necessary packages:

## Install and load required packages
install.packages("httr2")
library(httr2)

We can now read the JSON data directly from the NRL ladder page using httr2:

## Read URL in R
url <- paste0("https://www.nrl.com/ladder//data?competition=111&round=27&season=2023")

## Send HTTP GET request to website
response <- GET(url)

## Get response body as JSON format
json_response <- content(response, "text") %>% 
  fromJSON(simplifyVector = TRUE) %>% 
  pluck("positions")

This code sends an HTTP GET request to the NRL ladder page and extracts the competition table data in JSON format. It then converts this JSON data into a vector using pluck function.

Step 5: Convert JSON Data to tibble

We can now convert the extracted JSON data into a tibble:

## Convert JSON data to tibble
tibble_data <- json_response %>% 
  as_tibble() %>% 
  unnest(everything())

This step converts the JSON data vector into a tibble using as_tibble function. It then un nests each element of the tibble using unnest function.

Step 6: Display Competition Table Data

Finally, we can display the competition table data:

## Display competition table data
tibble_data %>% 
  print()

This code displays the extracted and converted competition table data in a neat format.

Conclusion

In this blog post, we learned how to webscrape the competition table from the NRL ladder page using R. We went through each step of the process, from inspecting the website structure to displaying the extracted data in a tibble format. The rvest and httr2 libraries were used for web scraping, with jsonlite library used to convert JSON data into R format.

The final code block provided above can be saved as an R script file (e.g., webscrape_nrl_ladder.R) in your working directory and run using source("webscrape_nrl_ladder.R").

Last modified on 2023-09-12