Group Abstract

Message Boards

WOLFRAM COMMUNITY

42.1K Views

13 Replies

18 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Staff Picks gridMathematica Mathematica External Programs and Systems High-Performance Computing Packages Know-How

How to use Mathematica in a high-performance computing (HPC) environment

Kevin Daily

Kevin Daily, Wolfram Research

Posted 9 years ago

I'm going to show you how you can use existing functionality to run a Mathematica script across a managed HPC cluster. Before I start, I must be upfront with you: though the individual commands are documented, this method, as a whole, is not. Thus, support for this procedure is outside the scope of Wolfram Technical Support. However, I'm hoping that once the ground-work has been laid, that Wolfram Community members can work together to fill in the missing details. My assumptions: Mathematica is installed and properly licensed on the managed cluster once your job has been given resources, that you can freely SSH between them (1) This is up to your local cluster's System Admin to figure out by talking with their organization and a Wolfram Sales Representative, and possibly Wolfram Technical Support (support.wolfram.com). (2) Again, this is up to your local SysAdmin to ensure. It's also known as a public/private key pair between nodes. In the following, I'm assuming the cluster uses Torque (Torque SysAdmin Guide), but in principle other managers can be used. A generic Mathematica script job submission may look like the following: #PBS -N Job_name #PBS -l walltime=10:30 #PBS -l nodes=4:ppn=6 #PBS -m be math -script hpc.wl In this example, the job is called "Job_name" the job will finish in 10 and a half minutes it is requesting 4 nodes with 6 processors-per-node, for a total of 24 resources (CPU cores) an email will be sent to the account associated with the username when the job (b)egins and when it (e)nds If you are not familiar with job submissions to a managed HPC cluster, then I suggest you read any guides provided by your organization The Wolfram Language script "hpc.wl" does the rest of the work. It generically follows this order: gather the environment variables associated with the list of provided resources launch remote subkernels for each CPU core do the parallel computations close the subkernels end the job (get association of resources, name of local host, and remove local host from available resources) hosts = Counts[ReadList[Environment["PBS_NODEFILE"], "String"]]; local = First[StringSplit[Environment["HOSTNAME"],"."]]; hosts[local]--; (launch subkernels and connect them to the controlling Wolfram Kernel) Needs["SubKernels`RemoteKernels`"]; Map[If[hosts[#] > 0, LaunchKernels[RemoteMachine[#, hosts[#]]]]&, Keys[hosts]]; (* ===== regular Wolfram Language code goes here ===== ) Print[ {$MachineName, $KernelID} ] ( ===== end of Wolfram Language program ===== *) CloseKernels[]; Quit On Torque there is the environment variable "PBS_NODEFILE" (Torque environment variables) that lists the different nodes that are provided to the job. It is my understanding that the name is repeated for each CPU core. That's why a simple Count of the node list tells us everything. The other piece of information, which is probably not necessary, is "HOSTNAME". This is where the Wolfram controlling kernel is running. In the above, we remove it from the list of available resources, but I don't believe this is necessary. According to the documentation ([3]), this may be known as "PBS_O_HOSTNAME". The Mathematica script should not need to change save for the code between the commented lines. I'm also assuming that `$RemoteCommand` (provided by Subkernels`RemoteKernels`) is the same on each node. This is usually the case as most clusters use a cloned file system. SLURM should be very similar except that the environment variables will be different. It is my understanding that headNode = Environment["SLURMD_NODENAME"]; nodes = ReadList["!scontrol show hostname $SLURM_NODELIST",String]; provides the headnode and list of resources. I encourage discussion.

POSTED BY: Kevin Daily

13 Replies

Sort By:

Sachin Kumar

Posted 4 years ago

Dear Kevin, Thanks for your reply. Yes, there are 40 cores on atulya049 as well. Ok, I will do one heavy calculation and get back to you. With thanks, Sachin. PS- In your First post, You suggested to remove the Wolfram controlling kernel from the list of available resources. So in my program, Out of requested cores (80), 79 are going into the calculation. Could you tell us why do we need to remove the controlling Wolfram kernel ?

POSTED BY: Sachin Kumar

Kevin Daily

Kevin Daily, Wolfram Research

Posted 4 years ago

Could you tell us why do we need to remove the controlling Wolfram kernel ? Removing it is not necessary. My thought was the CPU core on which the main kernel runs would be too busy managing the intra-kernel communications to do reasonable kernel calculations.

POSTED BY: Kevin Daily

Sachin Kumar

Posted 4 years ago

Kindly Help: I access mathematica (through MobaTerm), which is installed in my user area on Torque managed HPC, to connect multiple nodes assigned to my Job submitted in Queue. PROBLEM Following is the program to connect 2 nodes, and each node consists of 40 cores. Everything looks fine, expect the time taken by `ParallelTable` at the end of program, which is much higher than time taken by single node (without launching remote kernels) Note that allotted two nodes `atulya095` and `atulya049` appear 40 times each in the output. PROGRAM nodes=ReadList[Environment["PBS_NODEFILE"], "String"]; Print["alloted node are ", nodes]; (get association of resources, name of local host and remove local host from available resources) hosts = Counts[nodes]; local = First[StringSplit[Environment["HOSTNAME"],"."]]; Print["local node is ", local]; hosts[local]--; Needs["SubKernels`RemoteKernels`"]; Map[If[hosts[#] > 0, LaunchKernels[RemoteMachine[#, "ssh -x -f -l `3` `1` wolfram -wstp -linkmode Connect `4` -linkname '`2`' -subkernel -noinit", hosts[#]]]]&, Keys[hosts]]; Print["kernel count is ",$KernelCount]; Print[" machine name is ", ParallelEvaluate[$MachineName]]; Print[" kernel id is ",ParallelEvaluate[$KernelID]]; Print["processor count is ",$ProcessorCount]; Print[AbsoluteTiming[ParallelTable[Exp[Sin[x]]^Sin[x],{x,0.1,200000,0.0001}];][[1]]]; CloseKernels[]; Quit[]; OUTPUT alloted node are {atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049} local node is atulya095 kernel count is 79 machine name is {atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049} kernel id is {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79} processor count is 40 189.309181 I will be thankful to your help here.

Kindly Help:

I access mathematica (through MobaTerm), which is installed in my user area on Torque managed HPC, to connect multiple nodes assigned to my Job submitted in Queue.

PROBLEM

Following is the program to connect 2 nodes, and each node consists of 40 cores.

Everything looks fine, expect the time taken by ParallelTable at the end of program, which is much higher than time taken by single node (without launching remote kernels)

Note that allotted two nodes atulya095 and atulya049 appear 40 times each in the output.

PROGRAM

nodes=ReadList[Environment["PBS_NODEFILE"], "String"];
Print["alloted node are  ", nodes];    (*get association of resources, name of local host and remove local host from available resources*)
hosts = Counts[nodes];                                             
local = First[StringSplit[Environment["HOSTNAME"],"."]];
Print["local node is ", local];
hosts[local]--;
Needs["SubKernels`RemoteKernels`"];               
Map[If[hosts[#] > 0, LaunchKernels[RemoteMachine[#, "ssh -x -f -l `3` `1` wolfram  -wstp -linkmode Connect `4` -linkname '`2`' -subkernel -noinit", hosts[#]]]]&,    Keys[hosts]];                
Print["kernel count is  ",$KernelCount];           
Print[" machine name is  ", ParallelEvaluate[$MachineName]];
Print[" kernel id is  ",ParallelEvaluate[$KernelID]];
Print["processor count is  ",$ProcessorCount];
Print[AbsoluteTiming[ParallelTable[Exp[Sin[x]]^Sin[x],{x,0.1,200000,0.0001}];][[1]]];   
CloseKernels[];
Quit[];

OUTPUT

alloted node are  {atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049}
local node is atulya095
kernel count is  79
machine name is  {atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya095, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049, atulya049}
kernel id is  {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79}
processor count is  40
189.309181

I will be thankful to your help here.

POSTED BY: Sachin Kumar

Kevin Daily

Kevin Daily, Wolfram Research

Posted 4 years ago

My understanding is that `$ProcessorCount` returns the local number of available cores. The value of 40 indicates that on local node atulya095 there are 40 processors. We don't know how many processors are on atulya049, but I'm assuming 40 as well. The reason your parallel calculation appears to have no speed up is because you are doing a relatively cheap calculation `Exp[Sin[x]]^Sin[x]` and there is communication overhead in using `ParallelTable`. Each compute kernel sends information back-and-forth with the main kernel. That communication takes more time than the calculation itself. If you instead pretend to have an expensive calculation, like `Pause[3]`, then you should see a noticeable speed up because the intra-kernel communication is no longer the bottleneck in the parallel computation.

POSTED BY: Kevin Daily

zhe duan

Posted 5 years ago

Hi Kevin, I'm using a linux cluster of my insititute, we recently installed gridMathematica on it, and I'm pretty sure the first assumption you mentioned is well satisfied. However, our cluster has a dedicated login node as well as about 40 dedicated computing nodes, we users are only allowed to SSH to the login node, but don't have the previlage to SSH to the computing nodes, so the second assumption you mentioned is not satisfied in our case. The cluster is managed with SLURM, and mpi-based parallelization works well without the necessity of freely SSH among the nodes. So I just wonder if Mathematica has some mechanism to support large-scale parallelization if freely SSH is not allowed. Many thanks for your help! Zhe

POSTED BY: zhe duan

Kevin Daily

Kevin Daily, Wolfram Research

Posted 5 years ago

In practice, you only need to log in to the head node via SSH. Having no direct SSH access to the compute nodes from outside the cluster is a good thing! My understanding is that once a job is running on the cluster, that job's resources (the compute nodes provided to the job) can freely communicate with one another. Considering that MPI parallelization works on the cluster, I would be surprised that you could not also use my described method to have Mathematica run on the SLURM cluster. You first need to truly test your SSH issues before going any further. I suggest starting an interactive session so you have access to the command line, but request more than one compute node. In the following $> represents the terminal input. Echo the cluster manager commands e.g. $> !scontrol show hostname $SLURM_JOB_NODELIST $> echo $SLURM_TASKS_PER_NODE Then try a simple SSH command from your main node to one of the other compute nodes you've requested. Something like $> ssh remote-compute-node-name pwd If password-less SSH is available then this will display the remote compute node's working directory (via the 'pwd' command). If the nodes use a cloned file system then the working directory should be your home directory on the cluster. Resolve any issues before moving on.

POSTED BY: Kevin Daily

zhe duan

Posted 5 years ago

Hi Kevin, Thank you for your prompt reply! The attached file contains some of my previous test results. Since the cluster is very busy now, I will test the commands you suggest later on. I confirmed with our cluster manager, and was told that normal users cannot SSH to the computing nodes, only root users can. As to why MPI could still work in our cluster, I found some relevant information from the official website of SLURM: https://slurm.schedmd.com/quickstart.html. It is stated that: MPI use depends upon the type of MPI being used. There are three fundamentally different modes of operation used by these various MPI implementation. Slurm directly launches the tasks and performs initialization of communications through the PMI2 or PMIx APIs. (Supported by most modern MPI implementations.) Slurm creates a resource allocation for the job and then mpirun launches tasks using Slurm's infrastructure (older versions of OpenMPI). Slurm creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than Slurm, such as SSH or RSH. These tasks initiated outside of Slurm's monitoring or control. Slurm's epilog should be configured to purge these tasks when the job's allocation is relinquished. The use of pamslurmadopt is also strongly recommended. I think your implementation of Mathematica working with SLURM follows the third approach, while MPICH we are using in the cluster follows the first approach and does not require SSH. Now I wonder if Mathematica could be somehow configured to work with SLURM using the first two approaches so that SSH is not necessary. Your further comments and suggestions are highly appreciated! Best, Zhe Attachments: test_note_slurm_...docx

POSTED BY: zhe duan

zhe duan

Posted 4 years ago

Hi， I was granted SSH access privilege from the cluster manager and successfully ran parallel computing through SLURM. The attached file is copied from my test note then. Note: By the time I was trying to solve this issue, Mathematica working with SLURM follows this approach "Slurm creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than Slurm, such as SSH or RSH", and alternative approaches that do not require SSH access was not supported. I think supporting alternative approaches working w/ SLURM w/o requiring SSH access would be a very nice feature to be added to Mathematica's capability, just like being already available in MATLAB. Hopefully, this could be helpful to you. Attachments: mathematica_slur...docx

POSTED BY: zhe duan

Sachin Kumar

Posted 4 years ago

@zhe duan , Hi, Thanks for replying, I will take a look at attachment and get back to you.

POSTED BY: Sachin Kumar

Szabolcs Horvát

Posted 7 years ago

People who are looking at this might find this independent implementation, made for SGE and a specific HPC cluster, useful: https://bitbucket.org/szhorvat/crc/src An issue specific to this system was that ssh would not work, and rsh (a specific version of rsh!) had to be used.

POSTED BY: Szabolcs Horvát

Sachin Kumar

Posted 4 years ago

Your shared link is broken, Could you share it again. It might be helpful to me.

POSTED BY: Sachin Kumar

Kevin Daily

Kevin Daily, Wolfram Research

Posted 9 years ago

I can't comment on how the Mathematica book (do you mean Elementary Introduction to the Wolfram Language?) presents how to use the command line, but yes, this method has been around since Mathematica's inception. I have seen the method I presented above carried out on a SLURM managed cluster, and I used a Torque/PBS managed cluster in my previous employment. From a user's point-of-view I am familiar with how managed clusters work. However, I have not used all cluster managers, thus why I requested community support. My goal was to have instructions in how to use Mathematica with the most common cluster managers, documented in one place. As a follow up, one common-use scenario is to log in to the cluster's head node with X-windows forwarding enabled submit an "interactive session" to the cluster wait for the resources to be provided and control of the terminal returned to the client launch a Mathematica front end Keep in mind, though, that all commands must be sent back-and-forth from the remote cluster to your screen. This can have a pretty annoying delay. The benefit is that most clusters have nodes with more CPU cores than your desktop, and each CPU core may be more powerful than those in your desktop. The downside (besides the lag) is that if you've requested multiple nodes, then you don't know which node is acting as the head node. And if you use LaunchKernels[], then only the local kernels (on the head node) will launch. Because interactive jobs are intended as debugging sessions, one approach is to just exclusively request a single node. Alternatively, you can query the environment variables of the job session and launch remote kernels like I showed earlier.