Parallel Computing in R Using Future Package and PuTTY for High-Performance Computing

Introduction to Parallel Computing with R and Future Package

===========================================================

In today’s world of big data and high-performance computing, parallel processing has become an essential technique for accelerating computational tasks. In this article, we will explore how to use the parallel library in R to run scripts on a cluster of machines using PuTTY and SSH.

Background and Prerequisites


Before diving into the code, it’s essential to understand the basics of parallel computing and the tools involved.

  • Parallel Library: The parallel library is a part of R that allows us to execute tasks in parallel, which can significantly speed up computations.
  • PuTTY: PuTTY is a free implementation of the SSHv2 protocol, which enables secure remote access to computers over the internet.
  • Future Package: The future package provides an interface to the parallel library and offers additional features like automatic task allocation and verbose output.

Setting Up the Environment


To start with parallel computing in R, we need to install and load the required packages. We’ll also set up our environment by configuring PuTTY and setting up SSH key authentication.

Installing Required Packages

# Install required packages
install.packages("future")

Loading Packages

# Load the necessary packages
library(future)
library(parallel)

Creating a Cluster of Machines


To create a cluster of machines, we’ll use the makeCluster function from the parallel library. We’ll specify the type of cluster as PSOCK (which uses SSH for communication) and provide the machine addresses with their respective user names and core numbers.

Sample Code

# Define machine addresses and user information
primary <- '171.27.27.190'
machineAddresses <- list(
    list(host = primary, user = 'james', ncore = 2),
    list(host = '173.29.50.45', user = 'james', ncore = 4)
)

# Define the specification for each machine
spec <- lapply(machineAddresses, function(machine) {
    rep(list(list(host = machine$host), user = machine$user), n = machine$ncore)
})

# Create a cluster of machines
cl <- makeClusterPSOCK(
    host = spec,
    port = 11671,
    user = 'james',
    rshcmd = c("plink", "-ssh", "-i", "C:/Users/james/.ssh/putty.ppk"),
    homogeneous = FALSE,
    verbose = TRUE
)

Running Tasks in Parallel


Once we have created the cluster, we can submit tasks to it using the submitTask function. We’ll provide a simple R script that will be executed on each machine.

Sample Code

# Define a task to be executed on each machine
task <- function() {
    # Execute a basic R script on this node
    system("Rscript", "--no-save",
           paste0("ssh", "localhost:11671", " \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'cat('")
           )
}

# Submit tasks to the cluster
subTasks(cl, task)

Verbose Output and Task Allocation


The future package provides additional features like verbose output and automatic task allocation. We can configure these settings using the configureCluster function.

Sample Code

# Configure the cluster for verbose output
configureCluster(cl, verbose = TRUE)

# Submit tasks to the cluster with automatic task allocation
subTasks(cl, task)

Conclusion


In this article, we have explored how to use the parallel library in R to run scripts on a cluster of machines using PuTTY and SSH. We’ve covered setting up the environment, creating a cluster of machines, running tasks in parallel, and configuring verbose output and task allocation. With these tools, you can significantly speed up computations by harnessing the power of your machine cluster.

Step-by-Step Solution


Here is the complete code for the step-by-step solution:

# Install required packages
install.packages("future")

# Load the necessary packages
library(future)
library(parallel)

# Define machine addresses and user information
primary <- '171.27.27.190'
machineAddresses <- list(
    list(host = primary, user = 'james', ncore = 2),
    list(host = '173.29.50.45', user = 'james', ncore = 4)
)

# Define the specification for each machine
spec <- lapply(machineAddresses, function(machine) {
    rep(list(list(host = machine$host), user = machine$user), n = machine$ncore)
})

# Create a cluster of machines
cl <- makeClusterPSOCK(
    host = spec,
    port = 11671,
    user = 'james',
    rshcmd = c("plink", "-ssh", "-i", "C:/Users/james/.ssh/putty.ppk"),
    homogeneous = FALSE,
    verbose = TRUE
)

# Define a task to be executed on each machine
task <- function() {
    # Execute a basic R script on this node
    system("Rscript", "--no-save",
           paste0("ssh", "localhost:11671", " \"Rscript --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'cat('")
           )
}

# Configure the cluster for verbose output
configureCluster(cl, verbose = TRUE)

# Submit tasks to the cluster with automatic task allocation
subTasks(cl, task)

This code will create a cluster of machines using PuTTY and SSH, execute basic R scripts on each machine, and provide verbose output for better debugging.


Last modified on 2023-12-10