Extracting Table-Like Data from HTML in R: A Step-by-Step Guide

Extracting Table-Like Data from HTML in R

When working with web scraping, one of the biggest challenges is navigating and extracting data from dynamically generated content. In this article, we’ll explore how to scrape a table-like index from HTML in R.

Introduction

Web scraping involves extracting data from websites that are not provided in a easily accessible format. One common approach is to use specialized packages such as rvest and xml2 to parse HTML and XML documents. However, when dealing with dynamically generated content, things can get tricky.

In this article, we’ll focus on extracting data from the table shown at [this website](https://www.icpsr.umich.edu/web/NAHDAP/search/variables?start=0&sort=STUDYID asc,DATASETID asc,STARTPOS asc&SERIESFULL_FACET_Q=606|Population Assessment of Tobacco and Health (PATH) Study Series&DATASETTITLE_FACET=Wave 4: Youth / Parent Questionnaire Data&EXTERNAL_FLAG=1&ARCHIVE=NAHDAP&rows=1000#). The table contains variable IDs, question text, variable type, and origin dataset from ICPSR’s PATH Survey data.

Understanding the Table Structure

The first step in scraping this table is to understand its structure. Upon inspection, we can see that the table is not a traditional HTML table with <table>, <tr>, and <td> elements. Instead, it appears to be a script-generated table that’s pulled dynamically via JavaScript during page rendering.

Using Regular Expressions

One approach to extracting this data is to use regular expressions (regex). We can use the stringr package to search for patterns in the HTML document and then extract the relevant information.

library(rvest)
library(stringr)
library(magrittr)
library(jsonlite)

s <- read_html('https://www.icpsr.umich.edu/web/NAHDAP/search/variables?start=0&sort=STUDYID asc,DATASETID asc,STARTPOS asc&SERIESFULL_FACET_Q=606|Population Assessment of Tobacco and Health (PATH) Study Series&DATASETTITLE_FACET=Wave 4: Youth / Parent Questionnaire Data&EXTERNAL_FLAG=1&ARCHIVE=NAHDAP&rows=1000#') %>%
    html_text()

r <- stringr::str_match(s, 'searchResults : (\\{.*\\}), searchConfig')

data <- jsonlite::parse_json(r[1,2])

docs <- data$response$docs

In this code snippet, we first read the HTML document using read_html. We then use stringr::str_match to search for patterns in the HTML document and extract the relevant information. The pattern we’re searching for is a JavaScript object that contains the table data.

Handling JSON Data

Once we’ve extracted the JavaScript object, we need to handle the resulting JSON data. In this case, the response$docs vector contains an array of objects, where each object represents a single row in the table.

To work with this data, we’ll use the jsonlite package to parse the JSON string into R’s native data types. This will allow us to manipulate and analyze the data using standard R tools and functions.

# Get the first document from the list (this is just an example - you may need to adjust this depending on your specific requirements)
doc <- docs[1]

# Parse the JSON string into a named vector
doc_data <- jsonlite::parse_json(doc)

# Print the resulting data frame
print(doc_data)

Creating a Data Frame

One of the most common use cases for web scraping is to create a data frame that can be easily imported into R. In this case, we’ll use the dplyr package to transform our JSON data into a tidy data frame.

# Load the dplyr library
library(dplyr)

# Create a new data frame from the parsed JSON data
doc_df <- doc_data %>%
    mutate(id = rowid,
           variable_id = doc_data$variables[1]$STUDYID,
           question_text = doc_data$questions[1]$QuestionText)

# Print the resulting data frame
print(doc_df)

In this code snippet, we first load the dplyr library and create a new data frame from our parsed JSON data. We then use the mutate function to add additional columns to the data frame, such as a unique row ID and variable IDs.

Conclusion

Web scraping can be a powerful tool for extracting data from websites that are not provided in an easily accessible format. In this article, we’ve explored how to scrape a table-like index from HTML in R using regular expressions, JSON parsing, and data frame creation. By following these steps, you should be able to extract the variable IDs, question text, variable type, and origin dataset from ICPSR’s PATH Survey data and create a spreadsheet inventory matrix of variable IDs and their corresponding question text.