Extracting Files from COES.org.pe Dataset Using Rvest Web Scraping Tool

Step 1: Understand the Problem

We need to extract all files from a specific dataset that is located on the web page at https://www.coes.org.pe/Portal/PostOperacion/Reportes/IEOD/2023/. The files are listed in the form of tables, and we have to navigate through multiple levels of pages (year, month, day) to reach them.

Step 2: Identify the Web Scraper Tool

We will use the rvest package for web scraping. It provides an interface to scrape elements from a webpage.

Step 3: Extract Monthly Links

First, we need to extract links for each month in 2023. We can do this by finding all the table elements with id containing “Post Operación/Reportes/IEOD/2023/” and getting their IDs.

url <- "https://www.coes.org.pe/Portal/PostOperacion/Reportes/IEOD/2023/"
ses <- rvest::read_html_live(url)
months <- ses %>%
  filter(id %in% paste0("Post Operación/Reportes/IEOD/2023/", 1:12)) %>%
  pull(id)

monthly_urls <- paste0(url, "?path=", months)

Step 4: Extract Daily Links for Each Month

Next, we need to extract links for each day in the month. We can do this by finding all the table elements with id containing “Post Operación/Reportes/IEOD/2023/” and getting their IDs.

daily_urls <- lapply(monthly_urls, function(url) {
  ses <- rvest::read_html_live(url)
  days <- ses %>%
    filter(id %in% paste0("Post Operación/Reportes/IEOD/2023/", 1:31)) %>%
    pull(id)

  daily_url_list <- lapply(days, function(day_id) {
    url <- paste0(monthly_urls[[which(urlsplit(url)$path == day_id)]], "?path=", day_id)
    return(url)
  })
  return(daily_url_list)
})

Step 5: Extract Table Links for Each Day

Then, we need to extract the table links for each day. We can do this by finding all the table elements with id “tbDocumentLibrary” and getting their URLs.

table_urls <- lapply(daily_urls, function(day_url_list) {
  ses <- rvest::read_html_live(day_url_list[[1]])
  t <- ses %>%
    filter(id == "tbDocumentLibrary") %>%
    pull()
  return(paste0("https://www.coes.org.pe/portal/browser/download?url=", day_url_list[[1]], "&", t$Nombre))
})

Step 6: Extract File Names and Save Them

Finally, we need to extract the file names from the table links. We can do this by finding all the words between two slashes in each URL.

file_names <- lapply(table_urls, function(url_list) {
  file_name <- strsplit(url_list[[1]], "/")[[1]][-length(strsplit(url_list[[1]]$URL,"/")[[1]])]
  return(file_name)
})

# print(file_names)

Please note that we have to be very careful here because we will get an error for all URLs where the file name is the same as the URL.

Step 7: Save the File Names

We can now save these file names in a list or data frame. This step would depend on our final use case and how you want to store your results.

# create a named vector from file_names
file_names <- setNames(file_names, paste0("File Name for ", day_url_list[[1]]))

# print the result
print(file_names)

Please note that this is just an example of code we can use. There may be other ways to solve this problem based on your specific requirements.

Step 8: Consider Error Handling

We need to consider error handling in our web scraping script. For example, what if there are no files found for a particular day? We should add some sort of try/catch block or else condition to handle such situations and not crash the entire program.

Step 9: Review the Code

Finally, we need to review our code for efficiency and performance. Is there any way we can reduce the number of web requests made by rvest::read_html_live()? Can we use a single request instead of multiple requests?

The final answer is: $\boxed{None}$

Last modified on 2024-04-19