Message Boards Message Boards

5
|
15641 Views
|
13 Replies
|
20 Total Likes
View groups...
Share
Share this post:

RemoteBatchSubmit job on AWS with GPUs stuck "Runnable"

Posted 3 years ago

I was positively surprised by RemoteBatchSubmit in v12.2 - finally a productised, supported way to run jobs on cloud platforms! I have been waiting for this for a long time.

Initial trials went well - the provided workflow guide was easy to follow and initial tests worked nicely. Unfortunately I was less thrilled when after repeated attempts I was unable to succeed with the example in 12.2 release announcement blog post:

RemoteBatchSubmit[env,
 NetTrain[NetModel["LeNet"], "MNIST", TargetDevice -> "GPU"],
 RemoteProviderSettings -> <|"GPUCount" -> 1|>]

No matter how I attempt this (I've created environments on multiple AWS regions - including us-east-1 used in the blog post - tinkering with included instance types, etc. to no avail), the result of including "GPUCount" -> 1 setting results the "JobStatus" property of the job being permanently stuck as "Runnable" (until aborted, of course).

I believe ability to perform GPU-based training jobs on AWS is a major attraction for Mathematica 12.2 users. Please provide a working example where these jobs don't get stuck in the queue forever...

POSTED BY: Jari Kirma
13 Replies

Thanks for this helpful discussion.

I could use some advice on how to use AWS from Wolfram more effectively. I've managed with some struggle to create the necessary permissions and when I use the (updated) CloudFormation link referenced in the documentation to obtain a vanilla environment, I can do modest computations just fine. That is, I was delighted when RemoteBatchSubmit[env, blahblah] returned EvaluationResults in reasonable time. But ...

I now need to do some neural network training and evaluation that essentially requires a GPU. So, my strategy was to put just "p3" in the Available Instance Types field of the CloudFormation form. I did so because p3 instances appear to have GPUs available to them whereas some of the other default Available Instance Types do not. I also set the Default GPU Environment field in the CloudFormation form to 1. Here's the RemoteBatchSubmissionEnvironment that was returned to me.

env2=RemoteBatchSubmissionEnvironment["AWSBatch", <|
  "JobQueue" -> 
   "arn:aws:batch:us-east-1:347566773302:job-queue/WolframJobQueue-\
d3B4hun0Mh0A8IKa", 
  "JobDefinition" -> 
   "arn:aws:batch:us-east-1:347566773302:job-definition/\
WolframJobDefinition-cf703d5831a3a0c:1", 
  "IOBucket" -> "gpuneeded-wolframjobdatabucket-1nr93xjmeu4g0"|>]

I then did what others have done to assess use of AWS. I ran the example in the Wolfram Documentation:

job=RemoteBatchSubmit[env2, 
 nt = NetTrain[NetModel["LeNet"], "MNIST", TargetDevice -> "GPU"], 
 RemoteProviderSettings -> <|"GPUCount" -> 1|>]

It's been about half an hour and perhaps I am too impatient, but I am still getting the sad "runnable" when I evaluate the following code:

job["JobStatus"]

There is no indication from AWS as to when (if ever) my submission will actually evaluate. So, the basic question is, Am I doing something wrong?

Things that perhaps I have screwed up.

  1. Do I need a paid account? Would a paid account help accelerate the process?
  2. Did I make a mistake by limiting myself to p3 machines in the CloudFormation template? What would be better?
  3. Other

A further note: I think there is a challenge for people (like me) who use Wolfram precisely because they are NOT computer scientists but find Wolfram both extraordinarily easy to use and exceedingly well documented (with an emphasis on examples). Occasionally, though, we need to leave the friendly, well-documented Wolfram Universe and move to other terrains. And often, I find, the documentation there is either poor or makes a huge number of unfounded assumptions about the knowledge and vocabulary of the user. It's often non-conceptual recipes that fail to generalize or verbal descriptions without any examples. While this may be fine when, for example, computer scientists use AWS, it is a real challenge when a person who has lived in the Wolfram Universe needs to go outside it. So, I very much appreciate Jesse Friedman's efforts to start bridging the gap.

Oh, and one more thing. I wonder if some amount of the desire to use AWS stems from the fact that -- still -- Mac users can not (to my knowledge) easily use a GPU to perform operations like NetTrain. I have been told for years that this is some limitation due to MXNet and that it might (or might not) go away. My feeling is that in 2022, Wolfram needs to figure out a way to unleash GPU performance for its many users doing Machine Learning on a Mac. (Or, if it can be done, let users know how to do it).

All help appreciated!

POSTED BY: Seth Chandler

Hi Seth,

I would guess that you might be running into an issue with EC2 instance quotas like I described in a comment above (starting "Regarding the issue with larger instances..."). Your account may have a limit on the number of P3 instances that you can launch; if the limit is zero, then you won't be able to launch any P3 instances at all. You can check the current limit in the "Service Quotas" section of the AWS console (look for the "Running On-Demand P instances" quota), and you can request a quota increase using the instructions here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-on-demand-instances.html#vcpu-limits-request-increase The quota is given in units of vCPUs, and p3 instances start at 8 vCPUs per instance, so if you request a quota increase to 8 vCPUs then it will be enough to run one instance at a time.

Let me know if this works for you!

POSTED BY: Jesse Friedman

I checked and AWS indeed allocates ZERO vCPUs to p3 instances (or anything else with an NVIDIA GPU so far as I can see). You have to ask for a quota increase. So, I've made the request. I have not heard anything back from AWS. Assuming they grant my humble request, I will see whether the AWS quota was the culprit. Thanks so much for your suggestion.

POSTED BY: Seth Chandler
Posted 3 years ago

As far as I remember my quota for those instances is one - enough for a home user. I'm not too concerned of it, though; I'm pondering more about NetTrain failing to get trained on GPU... even single GPU, that is.

POSTED BY: Jari Kirma

Since instance limits are now given in vCPUs, you'd have to check what that limit value is. The p3.16xlarge has 64 vCPUs, so if your limit is less than that you won't be able to run an 8-GPU job on such an instance. My default limit in a new account was 32 vCPUs.

I'm afraid I don't have any insights as to your issue with NetTrain, sorry.

I've come up with a fix for the original problem of AZ allocation; the CloudFormation template is here. (You can paste the URL into the Create Stack page in the CloudFormation console.) The template now uses the existing default subnets from your default VPC instead of creating a new VPC and subnet from scratch. If you decide to give it a try I'd very much appreciate if you let me know your results.

POSTED BY: Jesse Friedman
Posted 3 years ago

Thanks for the revised template! A quick test with it succeeds running a GPU job (MNIST NetTrain example from the release blog post), instead of getting stuck as "Runnable" as it did earlier.

I have to look at my large instance quotas - your analysis points to the likely cause. Unfortunately the odd fact that some networks seem to just fail at getting useful training results on GPUs may make it less useful, though.

POSTED BY: Jari Kirma

Great, glad that worked for you. I've updated the template pointed to from the documentation to this new version.

Perhaps you'd like to cross-post your M.SE question to Wolfram Community to get more eyes on it.

POSTED BY: Jesse Friedman

Hi Jari, thanks for posting about this. I suspect you may be running into an issue I've encountered sporadically in testing whereby the AWS VPC subnet created by the CloudFormation stack template gets assigned to an availability zone that doesn't support the p3 GPU instance family.

The template as currently configured creates a single subnet with no AZ specified, causing AWS to assign it to a random AZ in your current region. In us-east-1, for example, the p3 family is supported only in the us-east-1c, us-east-1d, and us-east-1f AZs*, so if the luck of the draw drops your subnet in us-east-1b, the AWS Batch compute environment won't be able to spin up p3 instances. You can confirm whether this is indeed the issue in the AWS Batch console under "Compute environments" - when a job is stuck in "Runnable", the compute environment associated with your CloudFormation stack will show "INVALID" under "Status", and if you click it you'll see a message like:

INVALID - CLIENT_ERROR - You must use a valid fully-formed launch template. Your requested instance type (p3.2xlarge) is not supported in your requested Availability Zone (us-east-1b). Please retry your request by not specifying an Availability Zone or choosing us-east-1c, us-east-1d, us-east-1f.

This is a known issue and I hope to be able to push a fix out to our template in the near future. This fix will likely involve either A) creating one subnet per AZ instead of only one subnet total, or B) using the default subnets in the user's default VPC instead of auto-creating a new VPC. (Unfortunately, CloudFormation does not make it easy to do either of these things.)

At present, the easiest workaround is to manually specify a subnet when you create a CloudFormation stack from the template. If your AWS account was created after late 2013, you should already have a default VPC in each region, with one default subnet for each AZ**. You can list these subnets in the console here: https://console.aws.amazon.com/vpc/home#subnets:DefaultForAz=Yes. Pick a subnet from this list that's in an AZ that supports the p3 instance family* and paste the "Subnet ID" into the "VPC subnet" field in the CloudFormation stack creation form (replacing the default text "AutoCreateNewVPC").

Let me know if you're able to work around the problem with the instructions above. My apologies for the inconvenience; I hope to have the underlying issue fixed soon.

* You can check this for the region of your choosing in the AWS console under EC2 > Instance Types. Check this page (https://console.aws.amazon.com/ec2/v2/home#InstanceTypeDetails:instanceType=p3.2xlarge) for the "p3.2xlarge" instance type, scroll down to the "Networking" section, and find the "Availability zones" list. See also this page for AWS's instructions on the same.
** If you don't have a default VPC, you can create one with the instructions here.

POSTED BY: Jesse Friedman
Posted 3 years ago

Jesse,

Great to hear of this workaround! I had a similar suspicion building up in this regard, but didn't care to create sufficient amount of test stacks to figure it out stochastically. That is, I suspected that there's some sort of per-stack AZ assignment that causes all this trouble...

EDIT: It works! (Specifying the subnet on the template, to be precise.) At least with one GPU, that is. I tried to request an instance with eight GPUs and it was stuck in "Runnable" state (without any indication of errors) for an hour before I aborted it. I wonder what was going on there, spot prices on the AZ seemed low for the instance type...

POSTED BY: Jari Kirma

Great, glad that got it working for you!

Regarding the issue with larger instances, is it possible you're running into the EC2 instance type quota on your account? My recollection is that the default P-type instance quota is fairly low; possibly not enough to launch the 8-GPU p3.16xlarge instance type. Check the "Running On-Demand P instances"* quota in the AWS Service Quotas console. (This deep link may or may not get you to the right page.) The units here are vCPUs; the p3.16xlarge type has 64 vCPUs, so if your quota is less than that you won't be able to launch that instance type. You can submit a quota increase request from the same page; in my experience they're processed extremely quickly.

The AWS support article you referenced earlier does contain some fairly buried instructions on checking if an AWS Batch compute environment is failing to scale because of quota issues - looks like you have to inspect the EC2 autoscaling group underpinning the compute environment.

* The name is misleading: EC2 quotas used to count numbers of instances, but about a year ago they were changed to count numbers of vCPUs. Apparently the names have yet to be updated.

POSTED BY: Jesse Friedman
Posted 3 years ago

Maybe it's good to see that I'm not alone with this issue. I really suspect the problem lies on AWS batch stack template or on implicit assumptions it makes about the environment where it's set up, but that's not a small file to debug.

I believe the following pages could be of help, but unfortunately the iteration cycle of instatiating an enviroment for blind experiments is not really short enough to make experimentation particularly great pastime: GPU Jobs and Why is my AWS Batch job stuck in RUNNABLE status?

POSTED BY: Jari Kirma

I couldnā€˜t get the example in the blog post to work either. GPU based training jobs would be very interesting especially for Mac Users. Maybe somebody from WRI can look into the issue and provide a working example or fix.

POSTED BY: Philipp Winkler

Hi Philipp, I've posted an updated template here, which should fix the problem. If you decide to try it, I'd very much appreciate if you let me know your results. (An explanation of the underlying problem and an initial workaround are here.)

POSTED BY: Jesse Friedman
Reply to this discussion
Community posts can be styled and formatted using the Markdown syntax.
Reply Preview
Attachments
Remove
or Discard

Group Abstract Group Abstract