Introduction to Downloading CSV Files from Folders and Subfolders with URL in R
As a data analyst, having access to large datasets can be crucial for making informed decisions. In this blog post, we will explore how to download all CSV files from folders and subfolders using the curl
package in R.
Background on the Problem Statement
The problem statement presents a scenario where we need to retrieve CSV files from a specific URL that contain weather data for various stations. The files are organized in a hierarchical structure, with each station having its own file within each month folder. We want to be able to pull data from a specific station (e.g., ABRT.csv) from all the subfolders within the parent directory.
Using dir()
Function
The provided R code snippet uses the dir()
function to list the files in a specific directory, including the “climate_data/temperature/” folder. The recursive=TRUE
argument allows us to search for files recursively within subdirectories, while full.names=TRUE
returns the full path of each file.
dir("climate_data/temperature/", recursive = TRUE, full.names = TRUE, pattern = "\\ABRT.csv$")
However, this approach requires us to know the directory structure and the names of the files in advance. We need a more efficient method to download all CSV files without manually specifying the directories or file names.
Using expand.grid()
Function
The given R code snippet uses the expand.grid()
function to create a data frame that contains all possible combinations of years and months. It then constructs URLs for each combination by concatenating the base URL with the year and month values.
eg <- expand.grid(2012:2021, c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12"))
eg$url <- paste("http://tiservice.hii.or.th/opendata/data_catalog/daily_rain/", eg[, 1], "/",
paste(eg[, 1], eg[, 2], sep = ""), "/", "ABRT.csv", sep = "")
eg$dest <- paste(eg[, 1], eg[, 2], "ABRT.csv", sep = "_")
This approach ensures that we cover all possible combinations of years and months, but it requires us to manually specify the base URL.
Using curl::curl_download()
Function
The given R code snippet uses the curl::curl_download()
function to download each CSV file to a local directory. However, this approach requires us to know the full path of each file in advance, which is not ideal.
for (i in 1:nrow(eg)) {
curl::curl_download(eg$url[1], eg$dest[1])
}
Alternative Solution Using xml2
and read.csv()
Functions
To overcome the limitations of the previous approaches, we can use the xml2
package to parse the HTML structure of the webpage and extract the CSV file URLs. We then use the read.csv()
function to download each CSV file.
First, install the required packages:
install.packages(c("xml2", "curl"))
Then, load the packages:
library(xml2)
library(curl)
Next, use the following code to extract the CSV file URLs and download them using read.csv()
:
# Load necessary libraries
library(xml2)
library(curl)
# Set base URL
base_url <- "http://tiservice.hii.or.th/opendata/data_catalog/daily_rain/"
# Use xml2 to parse HTML structure and extract CSV file URLs
csv_urls <- xmlParse(base_url) %>%
html_nodes("a") %>%
function(url) {
# Extract href attribute containing URL
href <- xmlGetAttr(url, "href")
# Check if URL points to a CSV file
if (str_detect(href, "\.csv$")) {
# Construct full URL by joining base URL with month and station names
url <- paste0(base_url, href)
# Return URL and local filename
return(list(url = url, dest = href))
} else {
NULL
}
}
# Download CSV files using read.csv()
for (i in seq_along(csv_urls)) {
result <- csv_urls[[i]]
if (!is.null(result)) {
# Construct full URL by joining base URL with month and station names
url <- paste0(base_url, result$href)
# Download CSV file to local directory using read.csv()
temp_dir <- tmpdir()
filename <- result$dest
temp_file <- file.path(temp_dir, filename)
curl::curl_download(url, temp_file)
# Load CSV file and print its contents
data <- read.csv(temp_file)
cat(data)
}
}
This solution uses the xml2
package to extract the CSV file URLs from the HTML structure of the webpage. It then downloads each CSV file using read.csv()
and prints its contents.
Conclusion
In this blog post, we explored how to download all CSV files from folders and subfolders using the curl
package in R. We discussed various approaches, including using the dir()
function, expand.grid()
function, curl::curl_download()
function, and a new approach using xml2
and read.csv()
functions.
We hope that this post has provided you with a better understanding of how to tackle similar problems in the future. Remember to always explore different approaches and consider factors such as efficiency, scalability, and maintainability when solving complex data analysis problems.
Last modified on 2024-07-30