Message Boards Message Boards

GROUPS:

RemoteBatchSubmit job on AWS with GPUs stuck "Runnable"

Posted 10 months ago
6144 Views
|
10 Replies
|
15 Total Likes
|

I was positively surprised by RemoteBatchSubmit in v12.2 - finally a productised, supported way to run jobs on cloud platforms! I have been waiting for this for a long time.

Initial trials went well - the provided workflow guide was easy to follow and initial tests worked nicely. Unfortunately I was less thrilled when after repeated attempts I was unable to succeed with the example in 12.2 release announcement blog post:

RemoteBatchSubmit[env,
 NetTrain[NetModel["LeNet"], "MNIST", TargetDevice -> "GPU"],
 RemoteProviderSettings -> <|"GPUCount" -> 1|>]

No matter how I attempt this (I've created environments on multiple AWS regions - including us-east-1 used in the blog post - tinkering with included instance types, etc. to no avail), the result of including "GPUCount" -> 1 setting results the "JobStatus" property of the job being permanently stuck as "Runnable" (until aborted, of course).

I believe ability to perform GPU-based training jobs on AWS is a major attraction for Mathematica 12.2 users. Please provide a working example where these jobs don't get stuck in the queue forever...

10 Replies
Posted 10 months ago

I couldnā€˜t get the example in the blog post to work either. GPU based training jobs would be very interesting especially for Mac Users. Maybe somebody from WRI can look into the issue and provide a working example or fix.

Hi Philipp, I've posted an updated template here, which should fix the problem. If you decide to try it, I'd very much appreciate if you let me know your results. (An explanation of the underlying problem and an initial workaround are here.)

Posted 10 months ago

Maybe it's good to see that I'm not alone with this issue. I really suspect the problem lies on AWS batch stack template or on implicit assumptions it makes about the environment where it's set up, but that's not a small file to debug.

I believe the following pages could be of help, but unfortunately the iteration cycle of instatiating an enviroment for blind experiments is not really short enough to make experimentation particularly great pastime: GPU Jobs and Why is my AWS Batch job stuck in RUNNABLE status?

Hi Jari, thanks for posting about this. I suspect you may be running into an issue I've encountered sporadically in testing whereby the AWS VPC subnet created by the CloudFormation stack template gets assigned to an availability zone that doesn't support the p3 GPU instance family.

The template as currently configured creates a single subnet with no AZ specified, causing AWS to assign it to a random AZ in your current region. In us-east-1, for example, the p3 family is supported only in the us-east-1c, us-east-1d, and us-east-1f AZs*, so if the luck of the draw drops your subnet in us-east-1b, the AWS Batch compute environment won't be able to spin up p3 instances. You can confirm whether this is indeed the issue in the AWS Batch console under "Compute environments" - when a job is stuck in "Runnable", the compute environment associated with your CloudFormation stack will show "INVALID" under "Status", and if you click it you'll see a message like:

INVALID - CLIENT_ERROR - You must use a valid fully-formed launch template. Your requested instance type (p3.2xlarge) is not supported in your requested Availability Zone (us-east-1b). Please retry your request by not specifying an Availability Zone or choosing us-east-1c, us-east-1d, us-east-1f.

This is a known issue and I hope to be able to push a fix out to our template in the near future. This fix will likely involve either A) creating one subnet per AZ instead of only one subnet total, or B) using the default subnets in the user's default VPC instead of auto-creating a new VPC. (Unfortunately, CloudFormation does not make it easy to do either of these things.)

At present, the easiest workaround is to manually specify a subnet when you create a CloudFormation stack from the template. If your AWS account was created after late 2013, you should already have a default VPC in each region, with one default subnet for each AZ**. You can list these subnets in the console here: https://console.aws.amazon.com/vpc/home#subnets:DefaultForAz=Yes. Pick a subnet from this list that's in an AZ that supports the p3 instance family* and paste the "Subnet ID" into the "VPC subnet" field in the CloudFormation stack creation form (replacing the default text "AutoCreateNewVPC").

Let me know if you're able to work around the problem with the instructions above. My apologies for the inconvenience; I hope to have the underlying issue fixed soon.

* You can check this for the region of your choosing in the AWS console under EC2 > Instance Types. Check this page (https://console.aws.amazon.com/ec2/v2/home#InstanceTypeDetails:instanceType=p3.2xlarge) for the "p3.2xlarge" instance type, scroll down to the "Networking" section, and find the "Availability zones" list. See also this page for AWS's instructions on the same.
** If you don't have a default VPC, you can create one with the instructions here.

Posted 10 months ago

Jesse,

Great to hear of this workaround! I had a similar suspicion building up in this regard, but didn't care to create sufficient amount of test stacks to figure it out stochastically. That is, I suspected that there's some sort of per-stack AZ assignment that causes all this trouble...

EDIT: It works! (Specifying the subnet on the template, to be precise.) At least with one GPU, that is. I tried to request an instance with eight GPUs and it was stuck in "Runnable" state (without any indication of errors) for an hour before I aborted it. I wonder what was going on there, spot prices on the AZ seemed low for the instance type...

Great, glad that got it working for you!

Regarding the issue with larger instances, is it possible you're running into the EC2 instance type quota on your account? My recollection is that the default P-type instance quota is fairly low; possibly not enough to launch the 8-GPU p3.16xlarge instance type. Check the "Running On-Demand P instances"* quota in the AWS Service Quotas console. (This deep link may or may not get you to the right page.) The units here are vCPUs; the p3.16xlarge type has 64 vCPUs, so if your quota is less than that you won't be able to launch that instance type. You can submit a quota increase request from the same page; in my experience they're processed extremely quickly.

The AWS support article you referenced earlier does contain some fairly buried instructions on checking if an AWS Batch compute environment is failing to scale because of quota issues - looks like you have to inspect the EC2 autoscaling group underpinning the compute environment.

* The name is misleading: EC2 quotas used to count numbers of instances, but about a year ago they were changed to count numbers of vCPUs. Apparently the names have yet to be updated.

Posted 10 months ago

As far as I remember my quota for those instances is one - enough for a home user. I'm not too concerned of it, though; I'm pondering more about NetTrain failing to get trained on GPU... even single GPU, that is.

Since instance limits are now given in vCPUs, you'd have to check what that limit value is. The p3.16xlarge has 64 vCPUs, so if your limit is less than that you won't be able to run an 8-GPU job on such an instance. My default limit in a new account was 32 vCPUs.

I'm afraid I don't have any insights as to your issue with NetTrain, sorry.

I've come up with a fix for the original problem of AZ allocation; the CloudFormation template is here. (You can paste the URL into the Create Stack page in the CloudFormation console.) The template now uses the existing default subnets from your default VPC instead of creating a new VPC and subnet from scratch. If you decide to give it a try I'd very much appreciate if you let me know your results.

Posted 10 months ago

Thanks for the revised template! A quick test with it succeeds running a GPU job (MNIST NetTrain example from the release blog post), instead of getting stuck as "Runnable" as it did earlier.

I have to look at my large instance quotas - your analysis points to the likely cause. Unfortunately the odd fact that some networks seem to just fail at getting useful training results on GPUs may make it less useful, though.

Great, glad that worked for you. I've updated the template pointed to from the documentation to this new version.

Perhaps you'd like to cross-post your M.SE question to Wolfram Community to get more eyes on it.

Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract