Message Boards Message Boards

GROUPS:

How to use Mathematica in a high-performance computing (HPC) environment

Posted 1 year ago
3388 Views
|
2 Replies
|
8 Total Likes
|

I'm going to show you how you can use existing functionality to run a Mathematica script across a managed HPC cluster. Before I start, I must be upfront with you: though the individual commands are documented, this method, as a whole, is not. Thus, support for this procedure is outside the scope of Wolfram Technical Support. However, I'm hoping that once the ground-work has been laid, that Wolfram Community members can work together to fill in the missing details.

My assumptions:

  1. Mathematica is installed and properly licensed on the managed cluster
  2. once your job has been given resources, that you can freely SSH between them

(1) This is up to your local cluster's System Admin to figure out by talking with their organization and a Wolfram Sales Representative, and possibly Wolfram Technical Support (support.wolfram.com). (2) Again, this is up to your local SysAdmin to ensure. It's also known as a public/private key pair between nodes.

In the following, I'm assuming the cluster uses Torque (Torque SysAdmin Guide), but in principle other managers can be used. A generic Mathematica script job submission may look like the following:

#PBS -N Job_name
#PBS -l walltime=10:30
#PBS -l nodes=4:ppn=6
#PBS -m be

math -script hpc.wl

In this example,

  • the job is called "Job_name"
  • the job will finish in 10 and a half minutes
  • it is requesting 4 nodes with 6 processors-per-node, for a total of 24 resources (CPU cores)
  • an email will be sent to the account associated with the username when the job (b)egins and when it (e)nds

If you are not familiar with job submissions to a managed HPC cluster, then I suggest you read any guides provided by your organization

The Wolfram Language script "hpc.wl" does the rest of the work. It generically follows this order:

  1. gather the environment variables associated with the list of provided resources
  2. launch remote subkernels for each CPU core
  3. do the parallel computations
  4. close the subkernels
  5. end the job

    (*get association of resources, name of local host, and remove local host from available resources*)
    hosts = Counts[ReadList[Environment["PBS_NODEFILE"], "String"]];
    local = First[StringSplit[Environment["HOSTNAME"],"."]];
    hosts[local]--;
    
    (*launch subkernels and connect them to the controlling Wolfram Kernel*)
    Needs["SubKernels`RemoteKernels`"];
    Map[If[hosts[#] > 0, LaunchKernels[RemoteMachine[#, hosts[#]]]]&, Keys[hosts]];
    
    (* ===== regular Wolfram Language code goes here ===== *)
    Print[ {$MachineName, $KernelID} ]
    (* ===== end of Wolfram Language program ===== *)
    
    CloseKernels[];
    Quit
    

On Torque there is the environment variable "PBS_NODEFILE" (Torque environment variables) that lists the different nodes that are provided to the job. It is my understanding that the name is repeated for each CPU core. That's why a simple Count of the node list tells us everything. The other piece of information, which is probably not necessary, is "HOSTNAME". This is where the Wolfram controlling kernel is running. In the above, we remove it from the list of available resources, but I don't believe this is necessary. According to the documentation ([3]), this may be known as "PBS_O_HOSTNAME".

The Mathematica script should not need to change save for the code between the commented lines. I'm also assuming that $RemoteCommand (provided by Subkernels`RemoteKernels`) is the same on each node. This is usually the case as most clusters use a cloned file system.

SLURM should be very similar except that the environment variables will be different. It is my understanding that

    headNode = Environment["SLURMD_NODENAME"];
    nodes = ReadList["!scontrol show hostname $SLURM_NODELIST",String]; 

provides the headnode and list of resources.

I encourage discussion.

2 Replies

enter image description here - Congratulations! This post is now Staff Pick! Thank you for your wonderful contributions. Please, keep them coming!

I can't comment on how the Mathematica book (do you mean Elementary Introduction to the Wolfram Language?) presents how to use the command line, but yes, this method has been around since Mathematica's inception.

I have seen the method I presented above carried out on a SLURM managed cluster, and I used a Torque/PBS managed cluster in my previous employment. From a user's point-of-view I am familiar with how managed clusters work. However, I have not used all cluster managers, thus why I requested community support. My goal was to have instructions in how to use Mathematica with the most common cluster managers, documented in one place.

As a follow up, one common-use scenario is to

  1. log in to the cluster's head node with X-windows forwarding enabled
  2. submit an "interactive session" to the cluster
  3. wait for the resources to be provided and control of the terminal returned to the client
  4. launch a Mathematica front end

Keep in mind, though, that all commands must be sent back-and-forth from the remote cluster to your screen. This can have a pretty annoying delay. The benefit is that most clusters have nodes with more CPU cores than your desktop, and each CPU core may be more powerful than those in your desktop.

The downside (besides the lag) is that if you've requested multiple nodes, then you don't know which node is acting as the head node. And if you use LaunchKernels[], then only the local kernels (on the head node) will launch. Because interactive jobs are intended as debugging sessions, one approach is to just exclusively request a single node.

Alternatively, you can query the environment variables of the job session and launch remote kernels like I showed earlier.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract