How to Download CSV Files from Folders and Subfolders Using R's curl Package

Introduction to Downloading CSV Files from Folders and Subfolders with URL in R

As a data analyst, having access to large datasets can be crucial for making informed decisions. In this blog post, we will explore how to download all CSV files from folders and subfolders using the curl package in R.

Background on the Problem Statement

The problem statement presents a scenario where we need to retrieve CSV files from a specific URL that contain weather data for various stations. The files are organized in a hierarchical structure, with each station having its own file within each month folder. We want to be able to pull data from a specific station (e.g., ABRT.csv) from all the subfolders within the parent directory.

Using dir() Function

The provided R code snippet uses the dir() function to list the files in a specific directory, including the “climate_data/temperature/” folder. The recursive=TRUE argument allows us to search for files recursively within subdirectories, while full.names=TRUE returns the full path of each file.

dir("climate_data/temperature/", recursive = TRUE, full.names = TRUE, pattern = "\\ABRT.csv$")

However, this approach requires us to know the directory structure and the names of the files in advance. We need a more efficient method to download all CSV files without manually specifying the directories or file names.

Using expand.grid() Function

The given R code snippet uses the expand.grid() function to create a data frame that contains all possible combinations of years and months. It then constructs URLs for each combination by concatenating the base URL with the year and month values.

eg <- expand.grid(2012:2021, c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"))
eg$url <- paste("http://tiservice.hii.or.th/opendata/data_catalog/daily_rain/", eg[, 1], "/", 
               paste(eg[, 1], eg[, 2], sep = ""), "/", "ABRT.csv", sep = "")
eg$dest <- paste(eg[, 1], eg[, 2], "ABRT.csv", sep = "_")

This approach ensures that we cover all possible combinations of years and months, but it requires us to manually specify the base URL.

Using curl::curl_download() Function

The given R code snippet uses the curl::curl_download() function to download each CSV file to a local directory. However, this approach requires us to know the full path of each file in advance, which is not ideal.

for (i in 1:nrow(eg)) {
  curl::curl_download(eg$url[1], eg$dest[1])
}

Alternative Solution Using xml2 and read.csv() Functions

To overcome the limitations of the previous approaches, we can use the xml2 package to parse the HTML structure of the webpage and extract the CSV file URLs. We then use the read.csv() function to download each CSV file.

First, install the required packages:

install.packages(c("xml2", "curl"))

Then, load the packages:

library(xml2)
library(curl)

Next, use the following code to extract the CSV file URLs and download them using read.csv():

# Load necessary libraries
library(xml2)
library(curl)

# Set base URL
base_url <- "http://tiservice.hii.or.th/opendata/data_catalog/daily_rain/"

# Use xml2 to parse HTML structure and extract CSV file URLs
csv_urls <- xmlParse(base_url) %>%
  html_nodes("a") %>%
  function(url) {
    # Extract href attribute containing URL
    href <- xmlGetAttr(url, "href")
    
    # Check if URL points to a CSV file
    if (str_detect(href, "\.csv$")) {
      # Construct full URL by joining base URL with month and station names
      url <- paste0(base_url, href)
      
      # Return URL and local filename
      return(list(url = url, dest = href))
    } else {
      NULL
    }
  }

# Download CSV files using read.csv()
for (i in seq_along(csv_urls)) {
  result <- csv_urls[[i]]
  
  if (!is.null(result)) {
    # Construct full URL by joining base URL with month and station names
    url <- paste0(base_url, result$href)
    
    # Download CSV file to local directory using read.csv()
    temp_dir <- tmpdir()
    filename <- result$dest
    temp_file <- file.path(temp_dir, filename)
    
    curl::curl_download(url, temp_file)
    
    # Load CSV file and print its contents
    data <- read.csv(temp_file)
    cat(data)
  }
}

This solution uses the xml2 package to extract the CSV file URLs from the HTML structure of the webpage. It then downloads each CSV file using read.csv() and prints its contents.

Conclusion

In this blog post, we explored how to download all CSV files from folders and subfolders using the curl package in R. We discussed various approaches, including using the dir() function, expand.grid() function, curl::curl_download() function, and a new approach using xml2 and read.csv() functions.

We hope that this post has provided you with a better understanding of how to tackle similar problems in the future. Remember to always explore different approaches and consider factors such as efficiency, scalability, and maintainability when solving complex data analysis problems.


Last modified on 2024-07-30