Extracting Table-Like Data from HTML in R
When working with web scraping, one of the biggest challenges is navigating and extracting data from dynamically generated content. In this article, we’ll explore how to scrape a table-like index from HTML in R.
Introduction
Web scraping involves extracting data from websites that are not provided in a easily accessible format. One common approach is to use specialized packages such as rvest
and xml2
to parse HTML and XML documents. However, when dealing with dynamically generated content, things can get tricky.
In this article, we’ll focus on extracting data from the table shown at [this website](https://www.icpsr.umich.edu/web/NAHDAP/search/variables?start=0&sort=STUDYID asc,DATASETID asc,STARTPOS asc&SERIESFULL_FACET_Q=606|Population Assessment of Tobacco and Health (PATH) Study Series&DATASETTITLE_FACET=Wave 4: Youth / Parent Questionnaire Data&EXTERNAL_FLAG=1&ARCHIVE=NAHDAP&rows=1000#). The table contains variable IDs, question text, variable type, and origin dataset from ICPSR’s PATH Survey data.
Understanding the Table Structure
The first step in scraping this table is to understand its structure. Upon inspection, we can see that the table is not a traditional HTML table with <table>
, <tr>
, and <td>
elements. Instead, it appears to be a script-generated table that’s pulled dynamically via JavaScript during page rendering.
Using Regular Expressions
One approach to extracting this data is to use regular expressions (regex). We can use the stringr
package to search for patterns in the HTML document and then extract the relevant information.
library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)
s <- read_html('https://www.icpsr.umich.edu/web/NAHDAP/search/variables?start=0&sort=STUDYID asc,DATASETID asc,STARTPOS asc&SERIESFULL_FACET_Q=606|Population Assessment of Tobacco and Health (PATH) Study Series&DATASETTITLE_FACET=Wave 4: Youth / Parent Questionnaire Data&EXTERNAL_FLAG=1&ARCHIVE=NAHDAP&rows=1000#') %>%
html_text()
r <- stringr::str_match(s, 'searchResults : (\\{.*\\}), searchConfig')
data <- jsonlite::parse_json(r[1,2])
docs <- data$response$docs
In this code snippet, we first read the HTML document using read_html
. We then use stringr::str_match
to search for patterns in the HTML document and extract the relevant information. The pattern we’re searching for is a JavaScript object that contains the table data.
Handling JSON Data
Once we’ve extracted the JavaScript object, we need to handle the resulting JSON data. In this case, the response$docs
vector contains an array of objects, where each object represents a single row in the table.
To work with this data, we’ll use the jsonlite
package to parse the JSON string into R’s native data types. This will allow us to manipulate and analyze the data using standard R tools and functions.
# Get the first document from the list (this is just an example - you may need to adjust this depending on your specific requirements)
doc <- docs[1]
# Parse the JSON string into a named vector
doc_data <- jsonlite::parse_json(doc)
# Print the resulting data frame
print(doc_data)
Creating a Data Frame
One of the most common use cases for web scraping is to create a data frame that can be easily imported into R. In this case, we’ll use the dplyr
package to transform our JSON data into a tidy data frame.
# Load the dplyr library
library(dplyr)
# Create a new data frame from the parsed JSON data
doc_df <- doc_data %>%
mutate(id = rowid,
variable_id = doc_data$variables[1]$STUDYID,
question_text = doc_data$questions[1]$QuestionText)
# Print the resulting data frame
print(doc_df)
In this code snippet, we first load the dplyr
library and create a new data frame from our parsed JSON data. We then use the mutate
function to add additional columns to the data frame, such as a unique row ID and variable IDs.
Conclusion
Web scraping can be a powerful tool for extracting data from websites that are not provided in an easily accessible format. In this article, we’ve explored how to scrape a table-like index from HTML in R using regular expressions, JSON parsing, and data frame creation. By following these steps, you should be able to extract the variable IDs, question text, variable type, and origin dataset from ICPSR’s PATH Survey data and create a spreadsheet inventory matrix of variable IDs and their corresponding question text.
Further Reading
For more information on web scraping with R, we recommend checking out the following resources:
- The
rvest
package documentation: https://github.com/rvest/rvest - The
xml2
package documentation: https://github.com/hrdt/xml2 - The
jsonlite
package documentation: https://github.com/johnkerlau/jsonlite
We also recommend checking out the following tutorials and guides:
- “Web Scraping with R” by Hadley Wickham (2013): https://www.cran.r-project.org/doc/manuals/r-release/html/Tutorials.html#web-scraping-with-r
- “R for Data Science” by Hadley Wickham and Garrett Grolemund (2016): https://ggplot2.tidyverse.org/
Last modified on 2023-12-04