Message Boards Message Boards

GROUPS:

Speed of Jobs on AWS with RemoteBatchSubmit

Posted 1 month ago
483 Views
|
3 Replies
|
2 Total Likes
|

Hello,

I have been playing with the new remote batch computation capabilities in Mathematica and have come across speed issues that I do not understand. My calculation involves thousands of Fourier transforms on complex lists with tens of thousands of elements per "step".

It runs reasonably fast on my local computer with a seven year old mobile i7 (4 cores, 8 threads) but much slower on AWS. Also the speed on AWS seems to be affected by parameters that (in my understanding) should not affect it. To rule out that the jobs run on slower hardware, I created a stack with only "c5a" as allowed instance type.

All tests are performed with one master kernel and zero subkernels. On my local machine CPU load is 50 %, so the calculation uses all cores. First I tested the performance of sequential single jobs with RemoteBatchSubmit:

single job speed

The achieved speeds are not great, but still something I could work with. When working with array batch jobs it gets more confusing and slower though.

When running an array job (RemoteBatchMapSubmit, 16 evaluations, 16 jobs, 1 vCPU each) every child job sees just one CPU, but 64 GB memory, the speed started at 0.8 steps/s when only one child was running but dropped to 0.05 steps/s when multiple ones were - should different child jobs influence each other like this?

To test this I started a single job with "VCPUCount" -> 2. This job reported 128 GB of $SystemMemory and ran with 0.9 steps/s. I started an additional and identical single job and the speed of both running jobs dropped to 0.6 steps/s. After starting a third identical job the speed of all three single jobs dropped to 0.3 steps/s.

Has anyone else experienced these issues or has a better understanding to explain what is happening? I would love to use this feature, but as of now there is no benefit, it runs much faster locally.

Thank you!

3 Replies

Hi Stefan, I don't have an immediate explanation for this problem, so it would be great if you could provide an example of code that runs with the performance profile you've shown so that I can investigate in detail. In lieu of that, here are some loose notes about RemoteBatchSubmit that may help you investigate the performance issue yourself:

  • The "VCPUCount" provider setting does not directly determine the number of logical processor cores visible to the operating system inside a job. The "VCPUCount" setting has the following roles:
    • Job scheduling: AWS Batch's scheduler considers the vCPU count requirement specified in a job when it schedules that job to an EC2 instance. AWS Batch will not schedule an N-vCPU job to an instance with fewer than N vCPUs, and it will not schedule more jobs to an instance (based on their "VCPUCount" settings) than fit in the vCPU count of that instance.
    • OS scheduler: The vCPU count is used as a weighting input to the Linux kernel process scheduling algorithm (via Docker). For example, if a 12-vCPU job and a 4-vCPU job are assigned to a 16-vCPU instance, the Docker container holding the 12-vCPU job will get approximately $\frac{12}{16}$ of the instance's total CPU time, while the 4-vCPU job will get approximately $\frac{4}{16}$ of total CPU time. This does not mean that 12 distinct logical processor cores are dedicated to the 12-vCPU job, and if a process inside the job container inspects its environment it will see all 16 logical processor cores. (But not via $ProcessorCount, as explained below.) This is relevant to your question about multiple array child jobs influencing each other - if multiple child jobs (or non-array jobs, for that matter) are running on a single instance, they do share the same set of processor cores and the same OS scheduler, so they can influence each other's performance.
    • $ProcessorCount: The value of $ProcessorCount inside a job container is specifically overridden by the RemoteBatchSubmit system to reflect the job's "VCPUCount" setting. This is primarily so that parallel functions will auto-launch a number of subkernels equal to this setting. You can always override this yourself by calling LaunchKernels[n] with the number n of subkernels that you want to run. More on why you might want to do this later.
  • The "Memory" provider setting is treated similarly in some ways to "VCPUCount". Like "VCPUCount", "Memory" acts as an input to the AWS Batch scheduler (i.e. AWS Batch won't assign more jobs to an instance than fit in the physical memory of that instance). "Memory" also behaves as a hard limit on the memory consumption of a job - if a running job exceeds its memory limit, it will be killed by AWS Batch. However, unlike $ProcessorCount, $SystemMemory does not reflect the value of "Memory" within a job. That symbol returns the underlying EC2 instance's total memory.
  • Many EC2 instance types, including the c5a family you mention, have hyperthreading enabled by default. (You can confirm this in the EC2 Instance Types console - click an instance type and you'll see a "threads per core" value.) On such instances, one vCPU is equivalent to one hardware thread, with two hardware threads per physical core. This means that if you submit a job with e.g. "VCPUCount" -> 16, and this job gets scheduled to a 16-vCPU instance (hence a 100% share of CPU time), the processes within it (1 master kernel + 16 subkernels) will technically be contending for only 8 physical processor cores. If your workload is highly CPU-bound, this may have a negative performance impact. For parallelized code, an easy workaround is to manually launch n/2 parallel subkernels for your "VCPUCount" -> n job by calling LaunchKernels[n/2] at the beginning of your job code, before any parallelization functions. (For RemoteBatchMapSubmit jobs, put this in the Initialization option.) This doesn't affect hyperthreading, but limits the number of parallel subkernels to the number of physical processor cores. If you want to completely disable hyperthreading on the instance you can do so by modifying the instance launch template, but I believe doing so effectively limits you to using a single instance type per compute environment. I can give instructions on this if needed.
  • The c5a instance family is based on AMD processors. The Wolfram Language kernel uses optimized libraries for certain numerics functionality, some of which perform better on Intel processors. I don't have the knowledge to say whether this is relevant for your Fourier transform-centric workload, but if you haven't already you might try some jobs on Intel-based c5 instances to compare.
  • You can monitor per-instance overall CPU utilization from the AWS console via CloudWatch - see instructions here.
  • For finer-grained metrics, you can connect to an instance in your compute environment via SSH. To do this, your compute environment must be created with an EC2 SSH key pair assigned - there's an option to specify a key pair at the bottom of the CloudFormation template. Once your CloudFormation stack has been created, find the WolframEC2SecurityGroup security group listed as an output of the stack and edit it to allow ingress on TCP port 22 for SSH. Then, after submitting a job and waiting for it to transition to the "Running" state, you can find the running instance in your EC2 console, copy the hostname, and connect to it using an SSH client with username ec2-user. Once connected to the instance, you can view the process list with top or see the resource utilization of running Docker containers (each corresponding to a job) with docker stats.

Regarding the slowdowns you saw with multiple concurrent jobs, it occurred to me that although $ProcessorCount is overridden to reflect the "VCPUCount" of each job, the processor counts used for internal multithreaded computation will still reflect the host instance's total physical core count. This would cause excessive contention with multiple concurrent jobs, where e.g. an N-physical-core instance has X jobs each running N software threads, resulting in N * X total threads contending for only N physical cores. You can change this internal processor count by evaluating at the start of your job code (or for array batch jobs, in the Initialization option):

SetSystemOptions["ParallelOptions" -> {
    "MKLThreadNumber" -> $ProcessorCount,
    "ParallelThreadNumber" -> $ProcessorCount
}]

Let me know if this has any effect for your application. This is probably something RemoteBatchSubmit should be doing automatically; I will look into having it do so.

Posted 1 month ago

Hi Jesse, thank you very much for the detailed description and the suggestions. I am working on a different project right now, but will try out your points as soon as I get to work on this some more.

I think setting the thread numbers might do the trick, as my calculation is not using any subkernels, but clearly the Mathematica functions I use are using multiple threads (e.g. 4 threads on my local machine with 4 physical cores). I will test this as well as comparing the speed with Intel based instances and limiting the threads to half the hardware threads to achieve one thread per physical core.

A lot of what you wrote here would be good to have in the documentation. Even though I read early on that vCPUs do not correspond to physical cores, it still took a while to understand how all the numbers are connected. For example, it was not clar to me in the beginning if the vCPU count given for RemoteBatchMapSubmit is per one job or per the whole array.

Thank you so much for your help!

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract