Scraping Collapsible Table in R: A Step-by-Step Guide
Introduction
In this article, we will explore how to scrape data from a collapsible table using R and the RSelenium
package. We’ll also cover some alternative approaches that can simplify the process.
The original post provided a solution for scraping the main table, but the poster was struggling with extracting sub-table data for each company. In this article, we will discuss how to approach this problem systematically and provide an example of how to scrape the entire dataset using RSelenium
.
Alternative Approach: Extracting Data from JavaScript Source
Unless RSelenium
is a strict requirement here, extracting data from JavaScript source seems more straightforward approach in this case.
If you check the network tab of your browser while loading that page and fiddling with tables, you can identify the actual data source through Search (CTRL + F in Chrome for Windows). Only meaningful match is a minified js script, all the details for all companies are embedded right there.
Let’s start by extracting that script url from page source, we can then fetch the script and execute it to obtain the data. This approach requires some technical knowledge of JavaScript and how it interacts with web pages.
Extracting Script URL
To extract the script URL, you can use the jsoup
package in R.
library(jsoup)
url <- "https://example.com"
# Get the HTML document
doc <- html(url)
# Find all script tags on the page
script_tags <- doc %>%
html_element("script") %>%
collect()
# Extract the URL from each script tag
urls <- lapply(script_tags, function(x) x $ attr("src"))
urls
This code will extract all script URLs from the HTML document. You can then use these URLs to fetch and execute the scripts.
Fetching and Executing Scripts
Once you have extracted the script URLs, you can use curl
or another HTTP client library in R to fetch and execute them.
library(curl)
urls <- c("https://example.com/script1.js", "https://example.com/script2.js")
for (url in urls) {
# Fetch the script
response <- read_html(url)
# Execute the script
output <- getBody(response)
}
This code will fetch each script and execute it, obtaining the resulting data.
Alternative Approach: Using webbrowser
Package
There is an alternative approach using the webbrowser
package in R. This package allows you to interact with web pages programmatically.
library(webbrowser)
browser <- new("RClient", url = "https://example.com")
# Type a query into the search bar
browser$typeToFind("search bar")
# Click on a link
browser$clickOn("link")
# Wait for some time to allow the page to load
usleep(1000000)
# Get all script tags on the page
script_tags <- browser$getElementById("scripts")
# Extract the URL from each script tag
urls <- lapply(script_tags, function(x) x $ attr("src"))
urls
This code will interact with the web page programmatically, allowing you to extract data without having to use RSelenium
.
Conclusion
Scraping data from a collapsible table in R can be challenging. In this article, we discussed two alternative approaches: extracting data from JavaScript source and using the webbrowser
package. We also covered some best practices for scraping web pages programmatically.
Whether you choose to extract data from JavaScript source or use the webbrowser
package, the most important thing is to approach the problem systematically and methodically.
Example Code
Here is an example code that combines all of the steps discussed in this article:
library(RSelenium)
library(htmltools)
library(jsoup)
# Launch a new browser instance
remDr <- rsDriver(browser = "firefox")
# Navigate to the webpage
remDr$navigate("https://example.com")
# Get all script tags on the page
script_tags <- remDr$getElementsWithText("script")
# Extract the URL from each script tag
urls <- lapply(script_tags, function(x) x $ attr("src"))
urls
# Close the browser instance
remDr$quit()
This code will launch a new browser instance, navigate to the webpage, extract all script tags on the page, and extract the URLs from them. Finally, it will close the browser instance.
Note: This is just an example code, you may need to modify it based on your specific requirements.
Last modified on 2024-09-03