Group Abstract

Message Boards

WOLFRAM COMMUNITY

12.1K Views

4 Replies

0 Total Likes

View groups...

Follow this post

Share this post:

GROUPS:

Using VCPUCount in AWS RemoteBatchSubmit

John Snyder

Posted 4 years ago

I've been using RemoteBatchSubmit on AWS and I am very happy with this new feature. What I don't understand is my jobs seem to stick in Runnable if I request VCPUCount->96. I think the default permissions don't typically allow a VCPU count this high. Do I need to request a higher limit on VCPUs from AWS? I notice in the AWS console EC2-Dashboard under Limits you can request higher VCPU numbers. I don't understand all these different jobs well enough to know for which, if any, I should request a higher limit. For example, on page 5 of the Dashboard under "Running On-Demand All Standard (A, C, D, H, I, M, R, T, Z) Instances" I can request a higher VPCU limit. Is this what I need to do, or is there some other reason high VCPU count jobs seems to stick in Runnable? In general, how can I get VCPUCount->96 jobs to run on AWS?

POSTED BY: John Snyder

4 Replies

Sort By:

John Snyder

Posted 4 years ago

Thanks Jesse. I submitted a request and Amazon increased my maximum vCPUs to 164. Now I can run jobs in the Wolfram batch stack using 96 vCPUs. But this has raised yet another question--can I use all of the 164 vCPUs Amazon has allotted to me? I tried rebuilding the batch stack setting the maximum number of vCPUs in the Wolfram template to 164. Unfortunately, I found that a test job using 128 vCPUs would not move out of a Runnable status even after an hour, so I killed the job. Is it possible for me to use more than 96 vCPUs, or does something in Wolfram's setup prevent the use of more than 96 vCPUs in any event? If it is possible to use more than 96 vCPUs, how do I setup the batch stack to allow it? Thanks!

POSTED BY: John Snyder

Jesse Friedman

Jesse Friedman, Wolfram Research

Posted 4 years ago

A single batch job (submitted with RemoteBatchSubmit) runs on a single compute instance, so it's limited to the vCPU count of the largest available instance type. For the `c5`, `r5`, and `m5` families, this is currently the `24xlarge`-size instance types with 96 vCPUs.* To use more vCPUs concurrently, you can submit multiple single batch jobs to run at the same time. If the "Maximum vCPUs" template parameter is >= 192 and your account vCPU limit is high enough, then you can submit two jobs with `"VCPUCount" -> 96` and they will run concurrently on two `24xlarge`-size instances. Array batch jobs (submitted with RemoteBatchMapSubmit), on the other hand, can take advantage of multiple running instances simultaneously by splitting a computation into several independent "child" jobs, in a similar manner to how ParallelMap distributes a series of computations across multiple processor cores. We've recently been running some research experiments with array batch jobs using around 900 active cores. Like ParallelMap, RemoteBatchMapSubmit effectively requires you to structure your program as a single, large Map operation. * The `x1` and `x1e` families support up to 128 vCPUs, but these are specialty instance types with very large amounts of memory and so have a much higher per-vCPU cost than the more general-purpose `c5`, `r5`, and `m5` families. You can see the full list of instance types (not all of which are usable with AWS Batch) in the Instance Types section of the EC2 console or on the EC2 pricing page.

POSTED BY: Jesse Friedman

Tsai Ming-Chou

Tsai Ming-Chou, National Defense Medical Center

Posted 3 years ago

This is my first try to run a code on AWS. I have a job that executes locally is the following code: deviceS=Flatten@ParallelTable[dection@deviceS[[i]], {i, 1, Length@deviceS}]; I rewrite it in the following way, hoping to execute it on AWS: deviceS=RemoteBatchMapSubmit[env, Flatten@ParallelTable[dection@deviceS[[i]], {i, 1, Length@deviceS}], RemoteProviderSettings -> <\|"VCPUCount" -> 8, "Memory" -> Quantity[32, "Gibibytes"]\|>, LicensingSettings -> <\|Method -> "OnDemand"\|>]; No response for a long time. Can you help me see what is wrong? Besides, my original data (deviceS) is a huge file (I save it as mx file). Should it be better to upload to AWS first?

POSTED BY: Tsai Ming-Chou

Jesse Friedman

Jesse Friedman, Wolfram Research

Posted 4 years ago

Hi John, I think you're likely on the right track looking at EC2 quotas. Assuming you left the "Available instance types" setting in the CloudFormation template at the default value "`c5, m5, r5, p3`", the "Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) Instances" quota you found is indeed what will limit the number of concurrent instances (measured in terms of vCPUs) that can run out of the `c5`, `m5`, and `r5` instance type families. (`p3` is for GPU computation and has its own quota.) If that quota setting is below 96, you won't be able to start a 96-core instance (`[c5,m5,r5].24xlarge` types), so your `"VCPUCount" -> 96` jobs won't get launched. You can request a quota increase in the AWS console on the page for that quota (direct link). In my experience AWS processes quota increase requests very quickly, often within minutes - I think the process is partially automated. Let me know if this doesn't solve your problem.

POSTED BY: Jesse Friedman

Reply to this discussion

Reply Preview

Attachments

Remove Add a file to this post

Follow this discussion

or Discard

Feedback