Parallelizing R Code on a Server with mclapply and Lattice Plotting Issues
As the demand for data analysis and visualization grows, it becomes increasingly important to optimize computational performance. One way to achieve this is by parallelizing code using the mclapply
function from the parallel
package in R. In this article, we will explore how to use mclapply
on a server with a HPC (High-Performance Computing) setup and investigate the issues that arise when working with Lattice plotting.
Background
The problem at hand is as follows: an external function to create a series of plots reads an object which is a raster sum made in advance, and then it plots this raster and finally on top of it, it draws a vector of a shapefile. The code uses mclapply
to parallelize the process, but when running on a server or HPC, the Lattice plotting fails with an “error packet 1: object not found” error.
What is mclapply?
The mclapply
function in R allows us to parallelize a list of functions across multiple cores. It works by dividing the input data into smaller chunks and processing them concurrently using a specified number of worker processes. The output is then collected and returned as a vector.
How to Set Up an mclapply Job
To set up an mclapply
job on a server or HPC, we need to follow these steps:
Set up the Server/ HPC Environment
- Install R with the required packages (e.g.,
parallel
,lattice
) and configure the environment for parallel processing.
- Install R with the required packages (e.g.,
Detect the Available Cores
- Use the
detectCores()
function to determine the number of available cores on the server/HPC.
- Use the
Set the Number of Worker Processes
- Set the number of worker processes using the
mc.cores
argument inmclapply
.
- Set the number of worker processes using the
Create a Cluster and Export Variables
- Create a cluster using the
makeCluster()
function and export variables to the workers.
- Create a cluster using the
Run the mclapply Job
- Run the
mclapply
job using the specified number of worker processes.
- Run the
Using Lattice for Plotting
Lattice is a popular data visualization library in R that provides a range of plotting functions, including levelplot
. In this section, we will discuss how to use Lattice for plotting and potential issues that may arise when running on a server/HPC.
Understanding Lattice
Lattice is designed to be highly extensible and allows users to create custom plot types by combining existing functions. The core functionality of Lattice revolves around the levelplot
function, which plots a 3D level surface from a matrix.
Issues with Lattice on Server/HPC
When running Lattice plotting on a server/HPC, several issues can arise:
- Object Not Found Error: This error is caused by an object not being found. In the provided code snippet, this issue occurs when trying to plot the vector layer.
- Memory Issues: Running Lattice plotting on a server/HPC can consume significant amounts of memory, especially if dealing with large datasets.
Troubleshooting and Workarounds
To troubleshoot issues with Lattice plotting on a server/HPC, try the following workarounds:
Use get() to Retrieve Objects
- Use the
get()
function to retrieve objects within the plot function. This can help resolve object not found errors.
- Use the
Read Vector Shapefile Inside Plot Function
- Read each time the vector shapefile inside the plot function instead of hardcoding its name.
Hardcode Name in Plot Function
- Hardcode the name of the shapefile object in the plot function to avoid issues with dynamic variable names.
Export Objects on Slaves
- Export objects on the slaves using
clusterExport()
to prevent memory issues.
- Export objects on the slaves using
Code Optimization
To optimize Lattice plotting code for performance, follow these best practices:
Use Efficient Data Structures
- Use efficient data structures like matrices instead of data frames.
Optimize Plotting Parameters
- Optimize plotting parameters such as resolution and color scales to reduce computational load.
Minimize Memory Usage
- Minimize memory usage by loading only necessary libraries and using temporary storage.
Example Use Case
To demonstrate the use of mclapply
with Lattice plotting on a server/HPC, consider the following example code:
# Define external function to create series of plots
conc.plot <- function(i, main.list.con.file, path, dupl.sources = FALSE, tm.series = tm, bldng.shp = "buildings.vector", color.scale.type = "macc"){
# Load required libraries and data
library(raster)
library(rasterVis)
library(grid)
library(lattice)
library(sp)
library(latticeExtra)
library(rgdal)
# Retrieve objects using get()
conc.field <- get(paste0("sum.rast.",i))
# Define plotting parameters
if(color.scale.type == "arbitrary"){
scale.tick <- seq(1,211,2)
scale.label <- c("very low", "low", "medium", "high", "very high")
scale.label.at <- c(10,40,80,150,200)
scale.col <- colorRampPalette(rev(c('#a50026','#d73027','#f46d43','#fdae61','#fee090','#ffffbf','#e0f3f8','#abd9e9','#74add1','#4575b4','#313695','#a1d99b')))
}
# Plot raster data
time.step <- as.integer(sub(".*\\b(\\d{5})\\b.*", "\\1", main.list.con.file[i]))
png(filename = paste(path, "conc_map_lev_",sprintf("%04d",time.step), ".png", sep=""), width = 300*7, height = 300*5, res=300, pointsize = 12, type="cairo")
print(rasterVis::levelplot(conc.field, margin=FALSE, maxpixel=1e12,
main = format(tm.series$date[time.step],"%B %d, %H:%M %Z", tz="Europe/Rome"),
col.regions = scale.col, at = scale.tick, colorkey = list(at = scale.tick, labels = list(at = scale.label.at, labels = scale.label), col = scale.col)))
dev.off()
# Print message indicating successful plotting
message(paste0("Saved concentration map for time step ", time.step,", i.e. ",format(tm.series$date[time.step],"%B %d, %H:%M", tz="Europe/Rome")))
}
# Create cluster with 4 worker processes
mc <- round(parallel::detectCores() * 0.5) + 1
# Export variables to workers
clusterExport(makeCluster(mc), varlist = c("buildings.vector"))
# Run mclapply job with plotting function
list.conc <- which(dupl.sg)
parallel::mclapply(list.conc, function(i) conc.plot(i, main.list.con.file = list.conc, path = conc.file.path, bldng.shp = "buildings.vector", color.scale.type = "arbitrary"), mc.cores = mc, mc.preschedule = FALSE)
In this example code, we define an external function conc.plot
that creates a series of plots using Lattice. We then create a cluster with 4 worker processes and export variables to the workers. Finally, we run the mclapply
job with the plotting function.
By following these best practices and workarounds, you can optimize your R code for parallel processing on a server/HPC and resolve issues with Lattice plotting.
Last modified on 2024-06-03