Parallelizing R Code on a Server with mclapply and Lattice Plotting Issues Optimization Strategies for High-Performance Computing

Parallelizing R Code on a Server with mclapply and Lattice Plotting Issues

As the demand for data analysis and visualization grows, it becomes increasingly important to optimize computational performance. One way to achieve this is by parallelizing code using the mclapply function from the parallel package in R. In this article, we will explore how to use mclapply on a server with a HPC (High-Performance Computing) setup and investigate the issues that arise when working with Lattice plotting.

Background

The problem at hand is as follows: an external function to create a series of plots reads an object which is a raster sum made in advance, and then it plots this raster and finally on top of it, it draws a vector of a shapefile. The code uses mclapply to parallelize the process, but when running on a server or HPC, the Lattice plotting fails with an “error packet 1: object not found” error.

What is mclapply?

The mclapply function in R allows us to parallelize a list of functions across multiple cores. It works by dividing the input data into smaller chunks and processing them concurrently using a specified number of worker processes. The output is then collected and returned as a vector.

How to Set Up an mclapply Job

To set up an mclapply job on a server or HPC, we need to follow these steps:

  1. Set up the Server/ HPC Environment

    • Install R with the required packages (e.g., parallel, lattice) and configure the environment for parallel processing.
  2. Detect the Available Cores

    • Use the detectCores() function to determine the number of available cores on the server/HPC.
  3. Set the Number of Worker Processes

    • Set the number of worker processes using the mc.cores argument in mclapply.
  4. Create a Cluster and Export Variables

    • Create a cluster using the makeCluster() function and export variables to the workers.
  5. Run the mclapply Job

    • Run the mclapply job using the specified number of worker processes.

Using Lattice for Plotting

Lattice is a popular data visualization library in R that provides a range of plotting functions, including levelplot. In this section, we will discuss how to use Lattice for plotting and potential issues that may arise when running on a server/HPC.

Understanding Lattice

Lattice is designed to be highly extensible and allows users to create custom plot types by combining existing functions. The core functionality of Lattice revolves around the levelplot function, which plots a 3D level surface from a matrix.

Issues with Lattice on Server/HPC

When running Lattice plotting on a server/HPC, several issues can arise:

  • Object Not Found Error: This error is caused by an object not being found. In the provided code snippet, this issue occurs when trying to plot the vector layer.
  • Memory Issues: Running Lattice plotting on a server/HPC can consume significant amounts of memory, especially if dealing with large datasets.

Troubleshooting and Workarounds

To troubleshoot issues with Lattice plotting on a server/HPC, try the following workarounds:

  1. Use get() to Retrieve Objects

    • Use the get() function to retrieve objects within the plot function. This can help resolve object not found errors.
  2. Read Vector Shapefile Inside Plot Function

    • Read each time the vector shapefile inside the plot function instead of hardcoding its name.
  3. Hardcode Name in Plot Function

    • Hardcode the name of the shapefile object in the plot function to avoid issues with dynamic variable names.
  4. Export Objects on Slaves

    • Export objects on the slaves using clusterExport() to prevent memory issues.

Code Optimization

To optimize Lattice plotting code for performance, follow these best practices:

  1. Use Efficient Data Structures

    • Use efficient data structures like matrices instead of data frames.
  2. Optimize Plotting Parameters

    • Optimize plotting parameters such as resolution and color scales to reduce computational load.
  3. Minimize Memory Usage

    • Minimize memory usage by loading only necessary libraries and using temporary storage.

Example Use Case

To demonstrate the use of mclapply with Lattice plotting on a server/HPC, consider the following example code:

# Define external function to create series of plots
conc.plot <- function(i, main.list.con.file, path, dupl.sources = FALSE, tm.series = tm, bldng.shp = "buildings.vector", color.scale.type = "macc"){

    # Load required libraries and data
    library(raster)
    library(rasterVis)
    library(grid)
    library(lattice)
    library(sp)
    library(latticeExtra)
    library(rgdal)

    # Retrieve objects using get()
    conc.field <- get(paste0("sum.rast.",i))

    # Define plotting parameters
    if(color.scale.type == "arbitrary"){
        scale.tick <- seq(1,211,2)
        scale.label <- c("very low", "low", "medium", "high", "very high")
        scale.label.at <- c(10,40,80,150,200)
        scale.col <- colorRampPalette(rev(c('#a50026','#d73027','#f46d43','#fdae61','#fee090','#ffffbf','#e0f3f8','#abd9e9','#74add1','#4575b4','#313695','#a1d99b')))
    }

    # Plot raster data
    time.step <- as.integer(sub(".*\\b(\\d{5})\\b.*", "\\1", main.list.con.file[i]))
    png(filename = paste(path, "conc_map_lev_",sprintf("%04d",time.step), ".png", sep=""), width = 300*7, height = 300*5, res=300, pointsize = 12, type="cairo")
    print(rasterVis::levelplot(conc.field, margin=FALSE, maxpixel=1e12,
                               main = format(tm.series$date[time.step],"%B %d, %H:%M %Z", tz="Europe/Rome"),
                               col.regions = scale.col, at = scale.tick, colorkey = list(at = scale.tick, labels = list(at = scale.label.at, labels = scale.label), col = scale.col)))
    dev.off()

    # Print message indicating successful plotting
    message(paste0("Saved concentration map for time step ", time.step,", i.e. ",format(tm.series$date[time.step],"%B %d, %H:%M", tz="Europe/Rome")))
}

# Create cluster with 4 worker processes
mc <- round(parallel::detectCores() * 0.5) + 1

# Export variables to workers
clusterExport(makeCluster(mc), varlist = c("buildings.vector"))

# Run mclapply job with plotting function
list.conc <- which(dupl.sg)
parallel::mclapply(list.conc, function(i) conc.plot(i, main.list.con.file = list.conc, path = conc.file.path, bldng.shp = "buildings.vector", color.scale.type = "arbitrary"), mc.cores = mc, mc.preschedule = FALSE)

In this example code, we define an external function conc.plot that creates a series of plots using Lattice. We then create a cluster with 4 worker processes and export variables to the workers. Finally, we run the mclapply job with the plotting function.

By following these best practices and workarounds, you can optimize your R code for parallel processing on a server/HPC and resolve issues with Lattice plotting.


Last modified on 2024-06-03