Group Abstract Group Abstract

Message Boards Message Boards

5
|
23.7K Views
|
13 Replies
|
20 Total Likes
View groups...
Share
Share this post:

RemoteBatchSubmit job on AWS with GPUs stuck "Runnable"

Posted 5 years ago

I was positively surprised by RemoteBatchSubmit in v12.2 - finally a productised, supported way to run jobs on cloud platforms! I have been waiting for this for a long time.

Initial trials went well - the provided workflow guide was easy to follow and initial tests worked nicely. Unfortunately I was less thrilled when after repeated attempts I was unable to succeed with the example in 12.2 release announcement blog post:

RemoteBatchSubmit[env,
 NetTrain[NetModel["LeNet"], "MNIST", TargetDevice -> "GPU"],
 RemoteProviderSettings -> <|"GPUCount" -> 1|>]

No matter how I attempt this (I've created environments on multiple AWS regions - including us-east-1 used in the blog post - tinkering with included instance types, etc. to no avail), the result of including "GPUCount" -> 1 setting results the "JobStatus" property of the job being permanently stuck as "Runnable" (until aborted, of course).

I believe ability to perform GPU-based training jobs on AWS is a major attraction for Mathematica 12.2 users. Please provide a working example where these jobs don't get stuck in the queue forever...

POSTED BY: Jari Kirma
13 Replies

Great, glad that got it working for you!

Regarding the issue with larger instances, is it possible you're running into the EC2 instance type quota on your account? My recollection is that the default P-type instance quota is fairly low; possibly not enough to launch the 8-GPU p3.16xlarge instance type. Check the "Running On-Demand P instances"* quota in the AWS Service Quotas console. (This deep link may or may not get you to the right page.) The units here are vCPUs; the p3.16xlarge type has 64 vCPUs, so if your quota is less than that you won't be able to launch that instance type. You can submit a quota increase request from the same page; in my experience they're processed extremely quickly.

The AWS support article you referenced earlier does contain some fairly buried instructions on checking if an AWS Batch compute environment is failing to scale because of quota issues - looks like you have to inspect the EC2 autoscaling group underpinning the compute environment.

* The name is misleading: EC2 quotas used to count numbers of instances, but about a year ago they were changed to count numbers of vCPUs. Apparently the names have yet to be updated.

POSTED BY: Jesse Friedman
POSTED BY: Jesse Friedman
POSTED BY: Jesse Friedman

Great, glad that worked for you. I've updated the template pointed to from the documentation to this new version.

Perhaps you'd like to cross-post your M.SE question to Wolfram Community to get more eyes on it.

POSTED BY: Jesse Friedman

Hi Philipp, I've posted an updated template here, which should fix the problem. If you decide to try it, I'd very much appreciate if you let me know your results. (An explanation of the underlying problem and an initial workaround are here.)

POSTED BY: Jesse Friedman

Since instance limits are now given in vCPUs, you'd have to check what that limit value is. The p3.16xlarge has 64 vCPUs, so if your limit is less than that you won't be able to run an 8-GPU job on such an instance. My default limit in a new account was 32 vCPUs.

I'm afraid I don't have any insights as to your issue with NetTrain, sorry.

I've come up with a fix for the original problem of AZ allocation; the CloudFormation template is here. (You can paste the URL into the Create Stack page in the CloudFormation console.) The template now uses the existing default subnets from your default VPC instead of creating a new VPC and subnet from scratch. If you decide to give it a try I'd very much appreciate if you let me know your results.

POSTED BY: Jesse Friedman
Posted 5 years ago

Jesse,

Great to hear of this workaround! I had a similar suspicion building up in this regard, but didn't care to create sufficient amount of test stacks to figure it out stochastically. That is, I suspected that there's some sort of per-stack AZ assignment that causes all this trouble...

EDIT: It works! (Specifying the subnet on the template, to be precise.) At least with one GPU, that is. I tried to request an instance with eight GPUs and it was stuck in "Runnable" state (without any indication of errors) for an hour before I aborted it. I wonder what was going on there, spot prices on the AZ seemed low for the instance type...

POSTED BY: Jari Kirma

I couldn‘t get the example in the blog post to work either. GPU based training jobs would be very interesting especially for Mac Users. Maybe somebody from WRI can look into the issue and provide a working example or fix.

POSTED BY: Philipp Winkler

I checked and AWS indeed allocates ZERO vCPUs to p3 instances (or anything else with an NVIDIA GPU so far as I can see). You have to ask for a quota increase. So, I've made the request. I have not heard anything back from AWS. Assuming they grant my humble request, I will see whether the AWS quota was the culprit. Thanks so much for your suggestion.

POSTED BY: Seth Chandler
POSTED BY: Seth Chandler
Posted 5 years ago
POSTED BY: Jari Kirma
Posted 5 years ago

As far as I remember my quota for those instances is one - enough for a home user. I'm not too concerned of it, though; I'm pondering more about NetTrain failing to get trained on GPU... even single GPU, that is.

POSTED BY: Jari Kirma
Posted 5 years ago

Maybe it's good to see that I'm not alone with this issue. I really suspect the problem lies on AWS batch stack template or on implicit assumptions it makes about the environment where it's set up, but that's not a small file to debug.

I believe the following pages could be of help, but unfortunately the iteration cycle of instatiating an enviroment for blind experiments is not really short enough to make experimentation particularly great pastime: GPU Jobs and Why is my AWS Batch job stuck in RUNNABLE status?

POSTED BY: Jari Kirma
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard