Scraping Collapsible Table Data in R Using RStudio's Webdriver and RSelenium Packages

Scraping Collapsible Table in R: A Step-by-Step Guide

Introduction

In this article, we will explore how to scrape data from a collapsible table using R and the RSelenium package. We’ll also cover some alternative approaches that can simplify the process.

The original post provided a solution for scraping the main table, but the poster was struggling with extracting sub-table data for each company. In this article, we will discuss how to approach this problem systematically and provide an example of how to scrape the entire dataset using RSelenium.

Alternative Approach: Extracting Data from JavaScript Source

Unless RSelenium is a strict requirement here, extracting data from JavaScript source seems more straightforward approach in this case.

If you check the network tab of your browser while loading that page and fiddling with tables, you can identify the actual data source through Search (CTRL + F in Chrome for Windows). Only meaningful match is a minified js script, all the details for all companies are embedded right there.

Let’s start by extracting that script url from page source, we can then fetch the script and execute it to obtain the data. This approach requires some technical knowledge of JavaScript and how it interacts with web pages.

Extracting Script URL

To extract the script URL, you can use the jsoup package in R.

library(jsoup)

url <- "https://example.com"

# Get the HTML document
doc <- html(url)

# Find all script tags on the page
script_tags <- doc %>% 
  html_element("script") %>%
  collect()

# Extract the URL from each script tag
urls <- lapply(script_tags, function(x) x $ attr("src"))

urls

This code will extract all script URLs from the HTML document. You can then use these URLs to fetch and execute the scripts.

Fetching and Executing Scripts

Once you have extracted the script URLs, you can use curl or another HTTP client library in R to fetch and execute them.

library(curl)

urls <- c("https://example.com/script1.js", "https://example.com/script2.js")

for (url in urls) {
  # Fetch the script
  response <- read_html(url)
  
  # Execute the script
  output <- getBody(response)
}

This code will fetch each script and execute it, obtaining the resulting data.

Alternative Approach: Using webbrowser Package

There is an alternative approach using the webbrowser package in R. This package allows you to interact with web pages programmatically.

library(webbrowser)

browser <- new("RClient", url = "https://example.com")

# Type a query into the search bar
browser$typeToFind("search bar")

# Click on a link
browser$clickOn("link")

# Wait for some time to allow the page to load
usleep(1000000)

# Get all script tags on the page
script_tags <- browser$getElementById("scripts")

# Extract the URL from each script tag
urls <- lapply(script_tags, function(x) x $ attr("src"))

urls

This code will interact with the web page programmatically, allowing you to extract data without having to use RSelenium.

Conclusion

Scraping data from a collapsible table in R can be challenging. In this article, we discussed two alternative approaches: extracting data from JavaScript source and using the webbrowser package. We also covered some best practices for scraping web pages programmatically.

Whether you choose to extract data from JavaScript source or use the webbrowser package, the most important thing is to approach the problem systematically and methodically.

Example Code

Here is an example code that combines all of the steps discussed in this article:

library(RSelenium)
library(htmltools)
library(jsoup)

# Launch a new browser instance
remDr <- rsDriver(browser = "firefox")

# Navigate to the webpage
remDr$navigate("https://example.com")

# Get all script tags on the page
script_tags <- remDr$getElementsWithText("script")

# Extract the URL from each script tag
urls <- lapply(script_tags, function(x) x $ attr("src"))

urls

# Close the browser instance
remDr$quit()

This code will launch a new browser instance, navigate to the webpage, extract all script tags on the page, and extract the URLs from them. Finally, it will close the browser instance.

Note: This is just an example code, you may need to modify it based on your specific requirements.


Last modified on 2024-09-03