Hi Jari, thanks for posting about this. I suspect you may be running into an issue I've encountered sporadically in testing whereby the AWS VPC subnet created by the CloudFormation stack template gets assigned to an availability zone that doesn't support the p3 GPU instance family.
The template as currently configured creates a single subnet with no AZ specified, causing AWS to assign it to a random AZ in your current region. In us-east-1
, for example, the p3 family is supported only in the us-east-1c
, us-east-1d
, and us-east-1f
AZs*, so if the luck of the draw drops your subnet in us-east-1b, the AWS Batch compute environment won't be able to spin up p3 instances. You can confirm whether this is indeed the issue in the AWS Batch console under "Compute environments" - when a job is stuck in "Runnable"
, the compute environment associated with your CloudFormation stack will show "INVALID"
under "Status"
, and if you click it you'll see a message like:
INVALID - CLIENT_ERROR - You must use a valid fully-formed launch
template. Your requested instance type (p3.2xlarge) is not supported
in your requested Availability Zone (us-east-1b). Please retry your
request by not specifying an Availability Zone or choosing us-east-1c,
us-east-1d, us-east-1f.
This is a known issue and I hope to be able to push a fix out to our template in the near future. This fix will likely involve either A) creating one subnet per AZ instead of only one subnet total, or B) using the default subnets in the user's default VPC instead of auto-creating a new VPC. (Unfortunately, CloudFormation does not make it easy to do either of these things.)
At present, the easiest workaround is to manually specify a subnet when you create a CloudFormation stack from the template. If your AWS account was created after late 2013, you should already have a default VPC in each region, with one default subnet for each AZ**. You can list these subnets in the console here: https://console.aws.amazon.com/vpc/home#subnets:DefaultForAz=Yes. Pick a subnet from this list that's in an AZ that supports the p3 instance family* and paste the "Subnet ID" into the "VPC subnet" field in the CloudFormation stack creation form (replacing the default text "AutoCreateNewVPC").
Let me know if you're able to work around the problem with the instructions above. My apologies for the inconvenience; I hope to have the underlying issue fixed soon.
* You can check this for the region of your choosing in the AWS console under EC2 > Instance Types. Check this page (https://console.aws.amazon.com/ec2/v2/home#InstanceTypeDetails:instanceType=p3.2xlarge) for the "p3.2xlarge"
instance type, scroll down to the "Networking" section, and find the "Availability zones" list. See also this page for AWS's instructions on the same.
** If you don't have a default VPC, you can create one with the instructions here.