Outage in AWS

Amazon Elastic Container Service - Increased Rates of Insufficient Capacity Errors

Resolved Minor

August 24, 2022 - Started almost 4 years ago - Lasted about 13 hours

Incident Report

3:30 PM PDT We can confirm increased insufficient capacity error rates for launching new ECS tasks using the Fargate launch type, starting at 1:15 PM PDT. Running tasks and tasks that utilize the EC2 launch type are not impacted. We continue to work through full resolution and will provide another update in the next 30 minutes.

4:06 PM PDT We have identified the root cause for the increase in insufficient capacity error rates for launching new Fargate tasks and pods. Customers using ECS with Fargate and EKS with Fargate are impacted, starting at 1:15 PM PDT. We continue to work through full resolution of the issue. We expect you to begin observing signs of recovery beginning at 4:20 PM PDT. Running tasks and pods are not impacted. Customers using ECS with EC2 or EKS with EC2 are not impacted by this issue. We will provide another update in the next 30 minutes.

4:41 PM PDT We have identified the root cause for the increase in insufficient capacity error rates for launching new Fargate tasks and pods. Customers using ECS with Fargate and EKS with Fargate are impacted, starting at 1:15 PM PDT. We continue to work towards full resolution of the issue. This is taking longer than expected, and we now expect it to be 5:00 PM PDT before you will begin to observe signs of recovery. Running tasks and pods are not impacted. Customers using ECS with EC2 or EKS with EC2 are not impacted by this issue. We will provide an another update in the next 30 minutes.

5:20 PM PDT We have identified the root cause for the increase in insufficient capacity error rates for launching new Fargate tasks and pods. Customers using ECS with Fargate and EKS with Fargate are impacted, starting at 1:15 PM PDT. We continue to work towards full resolution of the issue, however we are experiencing some delays with full recovery. You will still see some task launches succeeding during this event. Running tasks and pods are not impacted. Customers using ECS with EC2 or EKS with EC2 are not impacted by this issue. We will provide an another update in the next 30 minutes.

5:49 PM PDT We have identified the root cause for the increase in insufficient capacity error rates for launching new Fargate tasks and pods. Customers using ECS with Fargate and EKS with Fargate are impacted, starting at 1:15 PM PDT. We continue to work towards full resolution of the issue, however we are experiencing some delays with full recovery. We are working multiple, parallel paths to make additional capacity available. You will still see some task launches succeeding during this event. Running tasks and pods are not impacted. Customers using ECS with EC2 or EKS with EC2 are not impacted by this issue. We will provide an another update in the next 30 minutes.

6:49 PM PDT We have identified the cause of the decreased capacity and understand why the Fargate task launch success rate is only 70% at this point. We are working on multiple parallel actions to address the underlying issues and have identified one area in particular that should help us make faster progress towards recovery. We have started work on this and will have an indication of progress by 7:00 PM PDT. Once we have that progress data we will be able to provide an ETA for recovery. We are also making a change to the rate at which ECS launches tasks to reduce load on Fargate and to speed up recovery. For customers with prepared and rehearsed plans for moving to a different region, they should consider exercising those if they are in a place to do so. Customers can also consider using EC2 with ECS and EKS as a mitigation option since ECS with EC2 and EKS with EC2 are not impacted by this event.

7:45 PM PDT We have identified the cause of the decreased capacity and understand why the Fargate task launch success rate is only 70% at this point. Our remediation actions are making slower progress than expected, so we are working on additional actions to further reduce load on Fargate. The work started in the previous update is still progressing but we do not yet have a projected ETA for when it will complete or when we will see recovery. Customers can switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

8:46 PM PDT The current state is that more than 50% of Fargate task launches in the US-EAST-1 region are succeeding. For the tasks that are failing, we have identified that this is due to a large amount of compute capacity, managed by Fargate, that is in a stuck state. We call these leaked instances. This results in customer tasks and pods not being able to be started, impacting both ECS on Fargate and EKS on Fargate. We are driving multiple parallel efforts to address this. First, we're taking action to make sure no additional instances are leaked by reducing the call rates to Fargate, and second we are working to free up these leaked instances so they can be used to run customer tasks. As stated in previous updates, the recovery actions are making slower progress than we expected which is preventing us from providing an ETA for recovery at this point. Right now we expect multiple hours before we see recovery. Customers who already have Fargate tasks or pods running are recommended to not scale down until we have recovery. Customers can switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

9:19 PM PDT To help speed up recovery, we have temporarily disabled Fargate task and pod launches in the US-EAST-1 region. While we are in this state, you will see all Fargate task and pod launches fail. Disabling Fargate task launches will free the service up to process and release the leaked instances, and we are already seeing faster progress towards this. Once we have released all the leaked instances and are seeing recovery in our other metrics, we will steadily ramp up Fargate task and pod launches and enable normal operations. We don't yet have an ETA, but will communicate one as soon as we have an ETA we feel confident about. In the meantime, customers are recommended to avoid scaling down tasks and pods as they will not be able to launch new tasks until we re-enable Fargate task and pod launches. Customers can also switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

9:55 PM PDT To help speed up recovery, we have temporarily disabled Fargate task and pod launches in the US-EAST-1 region. While we are in this state, you will see all Fargate task and pod launches fail. Disabling Fargate task launches will free the service up to process and release the leaked instances. We estimate that we have released 50% of the leaked instances and at the current rate that all leaked instances will have been released by 12:00 AM PDT. Once we have released all the leaked instances and are seeing recovery in our other metrics, we will steadily ramp up Fargate task and pod launches and enable normal operations. We expect to be able to communicate an ETA for recovery soon after we complete releasing all leaked instances. Once this happens, our best estimate is an additional 1 to 2 hours for recovery. In the meantime, customers are recommended to avoid scaling down tasks and pods as they will not be able to launch new tasks until we re-enable Fargate task and pod launches. Customers can also switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

10:26 PM PDT To help speed up recovery, we have temporarily disabled Fargate task and pod launches in the US-EAST-1 region. While we are in this state, you will see all Fargate task and pod launches fail. Disabling Fargate task launches will free the service up to process and release the leaked instances. We estimate that we have released 93% of the leaked instances and at the current rate that all leaked instances will have been released by 10:45 PM PDT. Once we have released all the leaked instances and are seeing recovery in our other metrics, we will slowly ramp up Fargate task and pod launches and enable normal operations. We expect to be able to communicate an ETA for recovery soon after we complete releasing all leaked instances. Once this happens, our best estimate is an additional 1 to 2 hours for recovery. In the meantime, customers are recommended to avoid scaling down tasks and pods as they will not be able to launch new tasks until we re-enable Fargate task and pod launches. Customers can also switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

11:42 PM PDT To help speed up recovery, we have temporarily disabled Fargate task and pod launches in the US-EAST-1 region. While we are in this state, you will see all Fargate task and pod launches fail. Disabling Fargate task launches will free the service up to process and release the leaked instances. We have now released effectively all leaked instances and are now starting the process to re-enable Fargate task launches. We will start by enabling Fargate task launches for a small number of accounts and then increase that as we see success. As the pace of recovery is dependent on how fast we increase task launches, it is still too early to provide an ETA for full recovery, but we still expect hours before we are fully recovered. In the meantime, customers are recommended to avoid scaling down tasks and pods as they will not be able to launch new tasks until we re-enable Fargate task and pod launches. Customers can also switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

Aug 25, 12:46 AM PDT We have started to enable Fargate task launches in the US-EAST-1 Region again. We are seeing successful task launches and continue to slowly increase the number of tasks being launched. Most customers will still see their task launches failing until we see enough progress to broadly enable task launches. We expect to have enough data to further increase task launches by 1:10 AM PDT. The progress at that point will help inform ETA for recovery, we still estimate this to be hours out. In the meantime, customers are recommended to avoid scaling down tasks and pods as they will not be able to launch new tasks until we re-enable Fargate task and pod launches. Customers can also switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

Aug 25, 1:55 AM PDT We continue making progress with enabling Fargate task launches in the US-EAST-1 Region. At this point, all accounts can launch tasks, but at a lower task launch rate than usual. The lower than normal task launch rates means customers can still see task launches failing due to attempting to launch tasks at a higher rate than currently allowed. The impact of this is that scaling of services and deployments will take longer than usual. We will incrementally raise task launch rates back to normal levels as we monitor service recovery. We are no longer seeing elevated task launch failures and our current estimate for full recovery is 4:00 AM PDT. We still recommend that customers avoid scaling down tasks and pods as they will not be able to launch new tasks until we re-enable Fargate task and pod launches. Customers can also switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

Aug 25, 3:08 AM PDT We continue making progress with enabling Fargate task launches in the US-EAST-1 Region. We have just completed a change that increases the task launch rate for tasks running as part of an ECS service. This will reduce the time required both for service deployments and scaling up services. As a reminder, all customers are unblocked from launching Fargate tasks at this point, although still at a rate lower than usual. The lower than normal task launch rate means customers can still see task launches failing due to attempting to launch tasks at a higher rate than currently allowed. We will continue to incrementally raise the task launch rate back to normal levels as we monitor service recovery. We are no longer seeing elevated task launch failures and our current estimate for full recovery is 5:00 AM PDT. We still recommend that customers avoid scaling down tasks and pods as they will not be able to launch new tasks until we re-enable Fargate task and pod launches. Customers can also switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

Aug 25, 4:03 AM PDT We continue making progress with enabling Fargate task launches in the US-EAST-1 Region. In addition to the earlier change to increase the task launch rate for tasks running as part of an ECS service, we have also increased the task launch rate for the RunTask API from 1 task per second to 2 tasks per second. As a reminder, all customers are unblocked from launching Fargate tasks at this point, although still at a rate lower than usual. The lower than normal task launch rate means customers can still see task launches failing due to attempting to launch tasks at a higher rate than currently allowed. We will continue to incrementally raise the task launch rate back to normal levels as we monitor service recovery. We are no longer seeing elevated task launch failures and our current estimate for full recovery is 5:00 AM PDT. We still recommend that customers avoid scaling down tasks and pods as they will not be able to launch new tasks until we re-enable Fargate task and pod launches. Customers can also switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

Aug 25, 4:34 AM PDT We continue making progress with enabling Fargate task launches in the US-EAST-1 Region. The task launch rate for the ECS RunTask API has now been increased from 2 tasks per second to 5 tasks per second and we are seeing a corresponding reduction in task launch failures due to customers exceeding the maximum task launch rate. We are continuing to monitor recovery and will keep incrementally raising the task launch rate until we return to the default of 20 tasks per second. As a reminder, all customers are unblocked from launching Fargate tasks at this point, although still at a rate lower than usual. The lower than normal task launch rate means customers can still see task launches failing due to attempting to launch tasks at a higher rate than currently allowed. We are no longer seeing elevated task launch failures. It is, however likely to be past 5:00 AM PDT before we have returned to the default task launch rate of 20 tasks per second. We still recommend that customers avoid scaling down tasks and pods as they will not be able to launch new tasks until we re-enable Fargate task and pod launches. Customers can also switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

Aug 25, 4:57 AM PDT We continue making progress with enabling Fargate task launches in the US-EAST-1 Region. All customers now have a task launch rate for the ECS RunTask API of 10 tasks per second and we see further reduction in task launch failures due to customers exceeding the maximum task launch rate. We are continuing to monitor recovery and will keep incrementally raising the task launch rate until we return to the default of 20 tasks per second. As a reminder, all customers are unblocked from launching Fargate tasks at this point, although still at a rate lower than usual. The lower than normal task launch rate means customers can still see task launches failing due to attempting to launch tasks at a higher rate than currently allowed. We are no longer seeing elevated task launch failures. It is, however likely to be past 5:00 AM PDT before we have returned to the default task launch rate of 20 tasks per second. We still recommend that customers avoid scaling down tasks and pods as they will not be able to launch new tasks until we re-enable Fargate task and pod launches. Customers can also switch to using EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

Components affected

Amazon ECS (us-east-1)

Trusted by 1,000+ teams

The Status Page Aggregator with Early Outage Detection

Stop finding out about outages from your users. Monitor 6,320+ cloud services and get alerted the second something breaks.

Start Free Trial Learn More

Latest AWS outages

Fable 5 and Mythos 5 Access - N. Virginia - 8 days ago

We are investigating increased error rates for Route53 API calls - 29 days ago

Increased API Error Rates - 29 days ago

Increased Error Rate and Latency - N. Virginia - about 1 month ago

Increased Connectivity Issues - Paris - about 2 months ago

The Status Page Aggregator with Early Outage Detection

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 6320 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook