Extracting Text from HTML after a Specific String in R

=====================================================

In this article, we will explore how to extract text from HTML files that contain a specific string. The problem is often encountered when dealing with large amounts of unstructured data, such as the 20k HTML files mentioned in the Stack Overflow question.

We will use the rvest package for web scraping and the stringr package for regular expressions to solve this problem.

Introduction

The rvest package is a popular choice for web scraping in R due to its ease of use and powerful features. However, it has some limitations when dealing with complex HTML structures.

On the other hand, the stringr package provides an efficient way to manipulate strings using regular expressions. It can be used to extract text from HTML files that contain a specific string.

Problem Statement

The problem is as follows:

We have a large number of HTML files containing unstructured data.
The data is stored in p tags, which can be contained within various HTML structures such as h, div, or td.
There are multiple paragraphs after each heading called “Principal Risks”.
Our goal is to extract all text relating to “Principal Risks” from these files.

Solution

To solve this problem, we will use the following steps:

Load the necessary packages.
Create a sample HTML file with the specified structure.
Use rvest to load the HTML file and extract all text.
Use stringr and regular expressions to extract the desired text from the extracted text.

Loading Necessary Packages

We will start by loading the necessary packages in R:

# Load required libraries
library(rvest)
library(stringr)

Creating a Sample HTML File

Next, we will create a sample HTML file with the specified structure. This will help us test our solution without having to navigate through large files.

# Create example text that contains "Principal Risks"
text = "Lorem ipsum dolor sit amet, Principal Risks consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, 
eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo."

Extracting Text using rvest

We will use rvest to load the HTML file and extract all text.

# Load the example HTML file
doc <- read_html("example.html")

# Extract all text from the HTML file
all_text = html_text(doc)

However, we need to find a way to extract the desired text that follows the “Principal Risks” string. We will use regular expressions for this purpose.

Using Regular Expressions with Stringr

Regular expressions are used to match patterns in strings. In our case, we want to extract all text that follows the “Principal Risks” string until a punctuation character is reached.

# Extract the desired text using regular expressions and stringr
desired_text <- str_extract_all(all_text, "(?<=Principal Risks).*\\.")

In this code snippet:

(?: is used to start a non-capturing group.
(?<=string) is used to specify that we want to match the “Principal Risks” string from the right side of the string.
.* matches any character (including spaces and newlines) zero or more times.
\. is a literal dot character. The backslash (\) is used to escape it because dot has a special meaning in regular expressions.

Handling Multiple Matches

We might want to extract all text that contains the “Principal Risks” string, not just one match. To do this, we can use str_extract_all with the same pattern:

# Extract all desired texts using str_extract_all
desired_texts <- str_extract_all(all_text, "(?<=Principal Risks).*\\.")

However, we need to make sure that there are no spaces between the “Principal Risks” string and the text that follows it. To handle this, we can use a different regular expression:

# Extract all desired texts using str_extract_all with spaces handled correctly
desired_texts <- str_extract_all(all_text, "\\b Principal Risks\\s*\\.?(?=\\s|$)")

In this code snippet:

\\b is used to match word boundaries. It ensures that we only match “Principal Risks” as a whole word.
\\s* matches zero or more spaces before the text.
\\.?(=\\s|$) is a positive lookahead assertion that checks if the current position in the string is immediately followed by either a space (\\s) or the end of the string ($). This ensures that we don’t extract any text that comes after the “Principal Risks” string.

Handling Large HTML Files

Handling large HTML files can be challenging because they may exceed memory constraints. To handle this, you can use html_text with a chunking function to process each chunk separately:

# Define a function to extract text in chunks
extract_text_in_chunks <- function(doc) {
  # Initialize an empty list to store the extracted texts
  desired_texts <- character(0)
  
  # Split the document into chunks of a certain size (e.g., 1000 characters)
  chunk_size <- 1000
  for (i in seq(from = 1, by = chunk_size, length = ceiling(length(doc) / chunk_size))) {
    # Extract text from the current chunk
    chunk <- doc[i:i + chunk_size - 1]
    
    # Extract all desired texts using str_extract_all with spaces handled correctly
    chunk_desired_texts <- str_extract(chunk, "\\b Principal Risks\\s*\\.?(?=\\s|$)")
    
    # Append the extracted texts to the list
    desired_texts <- c(desired_texts, chunk_desired_texts)
  }
  
  # Return the list of extracted texts
  return(desired_texts)
}

# Extract text from a large HTML file using extract_text_in_chunks
desired_texts <- extract_text_in_chunks(doc)

Conclusion

In this article, we learned how to extract text from HTML files that contain a specific string. We used rvest for web scraping and stringr for regular expressions.

The main steps in the solution are:

Load the necessary packages.
Create a sample HTML file with the specified structure.
Use rvest to load the HTML file and extract all text.
Use stringr and regular expressions to extract the desired text from the extracted text.

We also discussed how to handle multiple matches, large HTML files, and chunking functions to process each chunk separately.

Last modified on 2023-06-04