Outage in SMRT Systems

8-24-22 AWS Outage Causing Racking and Other Issues

Resolved Minor
August 25, 2022 - Started over 2 years ago
Official incident page

Need to monitor SMRT Systems outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including SMRT Systems, and never miss an outage again.
Start Free Trial

Outage Details

There's currently an AWS ECS outage causing SMRT queue delays. Users may be experiencing delays with racking, Metal Progetti Assembly, and other areas of the system that rely on our queue system. You can continue to use the system but be aware that there will be a delay for all events that use our queue system. AWS has identified the issue and is working to resolve it. Once they have resolved the issue our queues will process. This means that racking events, MP unloads and events, ready order texts, etc. will process after the issue is resolved.
Latest Updates ( sorted recent to last )
RESOLVED over 2 years ago - at 08/25/2022 03:16AM

AWS has yet to announce that the issue is resolved. However, our queues are back to normal levels and everything should be functioning as normal.

This is a function of reduced system load due to the late hour and the 70% recovery of AWS ECS. If you experience any issues tomorrow morning please reach out to support. Our dev team will be standing by and monitoring the queues even though we are back to normal.

Thanks for your patience,
SMRT Systems

MONITORING over 2 years ago - at 08/25/2022 02:57AM

New update from AWS.

7:45 PM PDT We have identified the cause of the decreased capacity and understand why the Fargate task launch success rate is only 70% at this point. Our remediation actions are making slower progress than expected, so we are working on additional actions to further reduce load on Fargate. The work started in the previous update is still progressing but we do not yet have a projected ETA for when it will complete or when we will see recovery. Customers can switch to using the EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

Our queue backlog is down over 5x from its peak a couple of hours ago. There's still a moderate backlog but most of the queue has been processed. We should be back to normal within the next hour if this pace holds. Thank you for your patience.

MONITORING over 2 years ago - at 08/25/2022 02:01AM

New update from AWS.

6:49 PM PDT We have identified the cause of the decreased capacity and understand why the Fargate task launch success rate is only 70% at this point. We are working on multiple parallel actions to address the underlying issues and have identified one area in particular that should help us make faster progress towards recovery. We have started work on this and have an indication on progress by 7:00 PM PDT. Once we have that progress data we will be able to provide an ETA for recovery. We are also making a change to the rate at which ECS launches tasks as part of ECS services to reduce load on Fargate and to speed up recovery. For customers with prepared and rehearsed plans for moving to a different region should exercise those if they are in a place to do so. Customers can also switch to using the EC2 with ECS and EKS as a mitigation since ECS with EC2 and EKS with EC2 are not impacted by this event.

We have seen a significant decline in the job count in our queues. That said, they are still backlogged and we'll continue to provide updates until the issue is fully resolved.

MONITORING over 2 years ago - at 08/25/2022 12:55AM

There was a recent update from AWS.

Amazon Elastic Container Service - Increased Rates of Insufficient Capacity Errors
5:49 PM PDT We have identified the root cause for the increase in insufficient capacity error rates for launching new Fargate tasks and pods. Customers using ECS with Fargate and EKS with Fargate are impacted, starting at 1:15 PM PDT. We continue to work towards full resolution of the issue, however we are experiencing some delays with full recovery. We are working multiple, parallel paths to make additional capacity available. You will still see some task launches succeeding during this event. Running tasks and pods are not impacted. Customers using ECS with EC2 or EKS with EC2 are not impacted by this issue. We will provide an another update in the next 30 minutes.

Here's the link to their status page (see the second issue Operational issue - Amazon Elastic Container Service (N. Virginia))
https://health.aws.amazon.com/health/status

MONITORING over 2 years ago - at 08/25/2022 12:48AM

There's currently an AWS ECS outage causing SMRT queue delays. Users may be experiencing delays with racking, Metal Progetti Assembly, and other areas of the system that rely on our queue system. You can continue to use the system but be aware that there will be a delay for all events that use our queue system.

AWS has identified the issue and is working to resolve it. Once they have resolved the issue our queues will process.

This means that racking events, MP unloads and events, ready order texts, etc. will process after the issue is resolved.

Latest SMRT Systems outages

SMRT Partial-Outage - about 1 year ago
SMRT Service Outage - over 1 year ago
7/17/23 - Route Messaging Outage - almost 2 years ago
SMRT Slowdowns - almost 2 years ago

Tired of not knowing when third party vendors are down?

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3969 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook