Hi Kevin,
Thank you for your prompt reply! The attached file contains some of my previous test results. Since the cluster is very busy now, I will test the commands you suggest later on.
I confirmed with our cluster manager, and was told that normal users cannot SSH to the computing nodes, only root users can.
As to why MPI could still work in our cluster, I found some relevant information from the official website of SLURM: https://slurm.schedmd.com/quickstart.html.
It is stated that:
MPI use depends upon the type of MPI being used. There are three fundamentally different modes of operation used by these various MPI implementation.
Slurm directly launches the tasks and performs initialization of communications through the PMI2 or PMIx APIs. (Supported by most modern MPI implementations.)
Slurm creates a resource allocation for the job and then mpirun launches tasks using Slurm's infrastructure (older versions of OpenMPI).
Slurm creates a resource allocation for the job and then mpirun launches tasks using some mechanism other than Slurm, such as SSH or RSH. These tasks initiated outside of Slurm's monitoring or control. Slurm's epilog should be configured to purge these tasks when the job's allocation is relinquished. The use of pamslurmadopt is also strongly recommended.
I think your implementation of Mathematica working with SLURM follows the third approach, while MPICH we are using in the cluster follows the first approach and does not require SSH. Now I wonder if Mathematica could be somehow configured to work with SLURM using the first two approaches so that SSH is not necessary.
Your further comments and suggestions are highly appreciated!
Best,
Zhe
Attachments: