Outage in AWS

Increased Error Rates and Delays - N. Virginia

Resolved Minor

September 28, 2023 - Started over 2 years ago - Lasted about 8 hours

Incident Report

We are investigating increased error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability Zone (use1-az2) in US-EAST-1 Region. EC2 System Status Check metrics may also be reporting “INSUFFICIENT_DATA” state for the affected instances. The issue does not affect networking connectivity for existing instances and we are working to resolve the issue.

Components affected

Amazon API Gateway Amazon API Gateway (us-east-1) Amazon AppStream 2.0 Amazon AppStream 2.0 (us-east-1) Amazon EC2 Amazon EC2 (us-east-1) Amazon ECS Amazon ECS (us-east-1) Amazon Elastic File System Amazon Elastic File System (us-east-1) Amazon EKS Amazon EKS (us-east-1) Amazon ELB Amazon ELB (us-east-1) Amazon EMR Amazon EMR (us-east-1) Amazon Managed Streaming for Apache Kafka Amazon Managed Streaming for Apache Kafka (us-east-1) Amazon MQ Amazon MQ (us-east-1) Amazon Redshift Amazon Redshift (us-east-1) Amazon WorkSpaces Amazon WorkSpaces (us-east-1) AWS CloudFormation AWS CloudFormation (us-east-1) AWS Lambda AWS Lambda (us-east-1) AWS NAT Gateway AWS NAT Gateway (us-east-1) AWS Transit Gateway AWS Transit Gateway (us-east-1) AWS VPCE PrivateLink AWS VPCE PrivateLink (us-east-1)

Need to monitor AWS outages?

Monitor all your external dependencies in one place
Get instant alerts when outages are detected
Be the first to know if service is down
Show real-time status on private or public status page
Keep your team informed

Start monitoring for free

Latest Updates ( sorted recent to last )

UPDATE over 2 years ago - at 09/29/2023 02:56AM

Most customers should begin trending towards recovery of error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability zone (use1-az2). We have been able to get additional healthy capacity online, which has allowed us to make progress on the backlog of network mappings. Our estimated time to recovery of the underlying subsystem is within the next hour. Once the main event has recovered, we will work to recover the PrivateLink data plane in the affected Availability Zone, which we expect to take several hours. We will also need to shift traffic back in for any services that are affected. While we are seeing the beginning signs of recovery, we recommend keeping traffic shifted out of the affected Availability Zone for now.

UPDATE over 2 years ago - at 09/29/2023 02:17AM

We have been continuing to work on resolving the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability zone (use1-az2). Since the last update, we have been able to work around the incorrect data in the mapping subsystem, and have restored some of the capacity back online. The majority of affected customers still will be unlikely to see recovery to their applications, but we are starting to see the first signs of customer visible recovery since the start of the event. As we recover additional healthy capacity, we expect more customers to see new network configuration updated on impacted instances in the impacted Availability Zone.

Where possible we have already shifted traffic out of the affected Availability Zone. For example, between 11:54 AM and 12:01 PM PDT, Elastic Load Balancing (ELB) shifted traffic away from the impacted Availability Zone. However, it is important you ensure that you have sufficient capacity in other Availability Zones to handle your use case. With that said, we continue to recommend that you shift away from the affected Availability Zone (use1-az2) if you are able, because it may help to mitigate the impact as we work to resolve the issue. For example, if you are still experiencing issues with Elastic Load Balancing or Amazon Transit Gateway, you can remove the subnets for the affected Availability Zone to prevent traffic from traversing it. We will provide additional updates by 8:00 PM PDT as well as updates to resume routing to the affected Availability Zone.

UPDATE over 2 years ago - at 09/29/2023 02:00AM

Customers using PrivateLink to connect to a service or host a service in the affected Availability Zone (use1-az2) are now experiencing connectivity issues. Given the duration of the event, new and existing network mappings are impacted. While we are actively looking for other paths to recovery, our current path to recovery will be after the main event resolves and then the PrivateLink data plane restarts in the single affected Availability Zone.

Again, if you are able to shift away from the affected Availability Zone (use1-az2) it may help to mitigate the impact as we work to resolve the issue. For example, if you are using Amazon Transit Gateway, you can remove the subnets for the affected Availability Zone to prevent traffic from traversing it.

UPDATE over 2 years ago - at 09/29/2023 01:01AM

Subsequent to the last update, the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability zone (use1-az2) have degraded in some cases. Instances launched before 11:10 AM PDT will still continue to function correctly. However, some instances launched after 11:10 AM PDT that have recovered may have regressed. Specifically, as we restored capacity, we saw that some of that capacity became unhealthy again. The root cause was data in the mapping subsystem that the subsystem could not process correctly, which we are working to remove. Once that issue has been resolved, we will still need to restart the recovery process on some of the capacity that was impacted, process the mapping backlog, and then shift traffic back into the affected Availability Zone.

Again, if you are able to shift away from the affected Availability Zone (use1-az2) it may help to mitigate the impact as we work to resolve the issue. We will provide further updates on our progress by 7:00 PM PDT as well as an update on efforts to resume routing traffic to the affected Availability Zone.

UPDATE over 2 years ago - at 09/29/2023 12:19AM

We continue to have all resources on our Engineering teams focused on mitigating the issue resulting in the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability zone (use1-az2). Progress has been slower than we had initially anticipated. In order to ensure we are not causing additional impact beyond the current issue, we are moving as fast as we can, but at a safe pace.

As of this point, instances launched prior to 11:10 AM PDT are continuing to work. However, some instances launched after 11:10 AM PDT will not see improvements in the routing to their instances. While the improvements are not visible, we are continuing to make positive strides towards recovery. Once we exceed a specific amount of recovered capacity, we expect to begin to see that new network configuration will be updated on impacted instances in the impacted Availability Zone (use1-az2). We expect to hit this required amount of capacity within the affected subsystem in the next hour, but expect full recovery to be a few hours away. After we have finished recovering the affected cells, we will still need to shift traffic back for affected services, like Elastic Load Balancing (ELB), so that customer traffic can resume routing to the impacted Availability Zone (use1-az2).

For most AWS services, we have worked to mitigate impact by shifting network traffic away from the affected Availability Zone (use1-az2). For customer applications, if you are able to shift away from the affected Availability Zone (use1-az2) it may help to mitigate the impact as we work to resolve the issue. We will provide further updates on our progress by 6:00 PM PDT as well as an update on efforts to resume routing traffic to the affected Availability Zone.

UPDATE over 2 years ago - at 09/28/2023 11:14PM

We continue to work toward resolving the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability Zone (use1-az2) in US-EAST-1 Region. At this time, we are focusing on iteratively restoring healthy capacity, ensuring that we do so safely. We can confirm that our mitigations are effective at reducing the propagation delay. We will provide further updates on our progress by 5:00 PM PDT, as well as a clear indication when we start seeing recovery in the affected cells.

UPDATE over 2 years ago - at 09/28/2023 10:19PM

We continue to work on resolving the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability Zone (use1-az2) in US-EAST-1 Region. An hour ago we provided a 1 to 2 hour ETA for recovery. The deployment of the initial update to address the recovery in the remaining affected cells has taken longer than we expected, so we are likely going to miss the expected ETA. We will provide further updates on our progress, as well as a clear indication when we start seeing recovery in the affected cells. As soon as the deployment of the initial update is complete, we will be able to provide a more clear ETA to resolution. We will provide an update by 4:00 PM PDT, or if we have additional information to share prior to that.

UPDATE over 2 years ago - at 09/28/2023 09:53PM

We have resolved the unrelated issue that was resulting in EC2 launch errors in the use1-az1 Availability Zone. We continue to work toward resolving the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability Zone (use1-az2) in US-EAST-1 Region. We will continue to provide updates as we progress.

UPDATE over 2 years ago - at 09/28/2023 09:38PM

We continue to work on resolving the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability Zone (use1-az2) in US-EAST-1 Region. We are also seeing an increase in error rates and latencies for new instance launches in another Availability Zone (use1-az1), which is unrelated and is an issue with creating new EBS volumes during the instance launch. We are also working to resolve that issue, but impact remains lower and unrelated to the network mapping propagation issue. On the network mapping propagation issue in the use1-az2 Availability Zone, we are in the process of deploying an update to address the slow recovery in the remaining cells.

UPDATE over 2 years ago - at 09/28/2023 09:15PM

We continue to work on resolving the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability Zone (use1-az2) in US-EAST-1 Region. We have identified the reason for the three cells that are seeing slower recovery and are currently working on resolving the issue. Once done, we should see an improvement in network mapping propagation times over the course of the next 1 to 2 hours. We’ll continue to provide regular updates.

UPDATE over 2 years ago - at 09/28/2023 08:43PM

We continue to work on resolving the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability Zone (use1-az2) in US-EAST-1 Region. The issue started at 11:10 AM PDT when we began to see an increase in network propagation for newly launched EC2 instances in a single Availability Zone (use1-az2) in US-EAST-1 Region. The mapping propagation times immediately began to recover in the vast majority of the affected underlying cells, but we have three cells that we continue to work on before we have full recovery. We have confirmed that this issue is only affecting network mapping propagations with in the affected Availability Zone (use1-az2) and all other Availability Zones are operating normally. We continue to work on the remaining cells for full recovery and will keep you updated on our progress.

We also wanted to let you know that this issue is not a repeat of the networking issue that occurred on September 18th. Although both issues affected network mapping propagation times, they involved very different subsystems within the EC2 Networking Distribution Plane.

UPDATE over 2 years ago - at 09/28/2023 08:07PM

We continue to work on resolving the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability Zone (use1-az2) in US-EAST-1 Region. We have seen some improvement in mapping propagation times in the majority of the affected cells within the Availability Zone (use1-az2) but continue to work on fully understanding the root cause of the event. We continue to recommend shifting network traffic away from the affected Availability Zone (use1-az2) if you are able to do so.

UPDATE over 2 years ago - at 09/28/2023 07:50PM

We continue to work on resolving the error rates and increased delays in network propagation for newly launched EC2 instances in a single Availability Zone (use1-az2) in US-EAST-1 Region. For AWS services, we have worked to mitigate impact by shifting network traffic away from the affected Availability Zone (use1-az2). For customer applications, if you are able to shift away from the affected Availability Zone (use1-az2) it may help to mitigate the impact as we work to identify the root cause and resolve the issue.

UPDATE over 2 years ago - at 09/28/2023 07:46PM

Latest AWS outages

Change Propagation Delays - 7 days ago

Increased Error Rates and Latencies - 7 days ago

Elevated latencies for network change propagation - Ireland - 20 days ago

Increased Error Rates and Latencies - N. Virginia - 4 months ago

The Status Page Aggregator with Early Outage Detection

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 5850 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook