Outage in AWS

Increased API Error Rates - N. Virginia

Resolved Minor
July 30, 2024 - Started about 2 months ago - Lasted about 6 hours

Need to monitor AWS outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including AWS, and never miss an outage again.
Start Free Trial

Outage Details

We are seeing increased error rates and latencies for some service APIs within the US-EAST-1 region.
Components affected
AWS Organizations Amazon OpenSearch Service (us-east-1) AWS IAM Identity Center AWS IAM Amazon Managed Workflows for Apache Airflow (us-east-1) AWS IAM Identity Center (us-east-1) Amazon OpenSearch Service Amazon Managed Workflows for Apache Airflow Amazon Location Service (us-east-1) AWS Glue (us-east-1) Amazon Location Service AWS Glue AWS HealthOmics AWS HealthOmics (us-east-1) AWS EB AWS EB (us-east-1) Amazon DocumentDB Amazon DocumentDB (us-east-1) AWS Outposts AWS Outposts (us-east-1) Amazon ElastiCache Amazon ElastiCache (us-east-1) Amazon EMR (us-east-1) AWS IoT Device Defender (us-east-1) AWS IoT Device Defender Amazon EMR AWS IoT TwinMaker AWS IoT SiteWise (us-east-1) AWS IoT TwinMaker (us-east-1) Amazon EMR Serverless (us-east-1) AWS IoT SiteWise Amazon EMR Serverless AWS AppFabric (us-east-1) AWS AppFabric AWS App Runner (us-east-1) AWS App Runner Amazon Kinesis Firehose Amazon Kinesis Data Streams (us-east-1) Amazon ECS Amazon Kinesis Firehose (us-east-1) Amazon Kinesis Data Streams Amazon API Gateway (us-east-1) Amazon CloudWatch AWS CodeBuild AWS CodeBuild (us-east-1) Amazon CloudWatch (us-east-1) Amazon ECS (us-east-1) Amazon API Gateway Amazon Quantum Ledger Database Amazon Quantum Ledger Database (us-east-1) Amazon SQS (us-east-1) AWS DataSync (us-east-1) AWS Application Migration Service AWS Application Migration Service (us-east-1) Amazon SQS AWS DataSync Amazon Personalize Amazon Personalize (us-east-1) AWS Client VPN AWS Client VPN (us-east-1) AWS IAM Roles Anywhere AWS IAM Roles Anywhere (us-east-1) Amazon S3 (us-east-1) Amazon Redshift Amazon Redshift (us-east-1) Amazon S3 AWS CloudTrail (us-east-1) AWS CloudTrail Amazon CloudSearch (us-east-1) AWS IoT Analytics AWS IoT Analytics (us-east-1) Amazon CloudSearch Amazon WorkSpaces (us-east-1) AWS CloudShell (us-east-1) Amazon WorkSpaces AWS Cloud9 AWS Cloud9 (us-east-1) Amazon AppStream 2.0 Amazon AppStream 2.0 (us-east-1) AWS CloudShell AWS License Manager AWS License Manager (us-east-1) AWS Elemental (us-east-1) AWS Elemental Amazon Managed Streaming for Apache Kafka (us-east-1) Amazon Managed Streaming for Apache Kafka AWS Resource Groups Multiple services (us-east-1) Multiple services AWS Transfer Family AWS Transfer Family (us-east-1) Amazon Connect (us-east-1) AWS IoT Events (us-east-1) Amazon Connect AWS IoT Events AWS Step Functions (us-east-1) AWS IoT Device Management (us-east-1) AWS Step Functions AWS IoT Device Management AWS CloudHSM Amazon FSx (us-east-1) AWS CloudHSM (us-east-1) Amazon FSx AWS Control Tower (us-east-1) AWS Control Tower AWS CloudFormation AWS CloudFormation (us-east-1) EC2 Image Builder EC2 Image Builder (us-east-1) Amazon EKS Amazon EKS (us-east-1) Amazon Managed Grafana Amazon Managed Grafana (us-east-1) Amazon Managed Service for Prometheus Amazon Kinesis Analytics Amazon Kinesis Analytics (us-east-1) Amazon Managed Service for Prometheus (us-east-1) AWS Lambda AWS Lambda (us-east-1) AWS Batch (us-east-1) AWS Batch Amazon SageMaker (us-east-1) Amazon Bedrock (us-east-1) Amazon CloudFront Amazon SageMaker Amazon Bedrock
Latest Updates ( sorted recent to last )
UPDATE about 2 months ago - at 07/31/2024 04:32AM

Kinesis Data Streams and Cloudwatch Logs error rates have fully recovered and are operating normally within the US-EAST-1 Region. Other services, including ECS Fargate, API Gateway, and Lambda have also recovered. While we would expect recovery for the vast majority of customer applications, we’re continuing to work towards full recovery.

UPDATE about 2 months ago - at 07/31/2024 03:01AM

We are seeing significant recovery for most AWS Services at this stage. While we are not yet fully recovered, most AWS Services are observing recovery. We are seeing full recovery for Fargate launches at this time. As we recover we expect to see new CloudWatch logs showing as they become available. We continue to work toward full recovery for remaining AWS Services. We continue to expect full recovery to be within the next 2 hours.

UPDATE about 2 months ago - at 07/31/2024 01:59AM

We continue to work toward recovery, though progress is occurring slower than originally anticipated. We are seeing some improvements internally, though they may not be visible externally. Some Services (like Cloudwatch Logs) may not observe recovery until we have fully resolved the underlying issue within the Kinesis subsytem. In parallel to our mitigation efforts, we are actively working to speed up the recovery process. At this time, we still expect full recovery to be 1-2 hours away. We will continue to share updates as we have additional information to share, or within the next 60 minutes.

UPDATE about 2 months ago - at 07/31/2024 01:00AM

We continue to work on resolving the increased error rates and latencies for Kinesis APIs in the US-EAST-1 Region. We wanted to provide you with more details on what is causing the issue. Starting at 2:45 PM PDT, a subsystem within Kinesis began to experience increased contention when processing incoming data. While this had limited impact for most customer workloads, it did cause some internal AWS services - including CloudWatch, ECS Fargate, and API Gateway to experience downstream impact. Engineers have identified the root cause of the issue affecting Kinesis and are working to address the contention. While we are making progress, we expect it to take 2 -3 hours to fully resolve.

As a result of this issue, CloudWatch logs is experiencing increased error rates and latencies when processing incoming logs. Any customer using the CloudWatch logs APIs may experience elevated errors. CloudWatch metrics extraction from these logs may be delayed and alarms may transition into "INSUFFICIENT_DATA" state if set on delayed metrics.

ECS Fargate is experiencing failures when attempting to launch new tasks, also because of a dependency on CloudWatch logs. We are currently working on a change to remove this dependency and have also taken steps to reduce the likelihood of task retirement.

API Gateway continues to process requests correctly but is seeing errors when sending logs to CloudWatch. Some customers may also experience error when using Lambda with API Gateway, but we believe this is related to failures within the Lambda function code itself, such as attempts to invoke CloudWatch logs APIs.

AWS Lambda continues process invocations correctly but is unable to send logs to CloudWatch logs. As a result, customers may not be able to see the logs of their asynchronous Lambda invocations.

We have also seen periods of elevated failures with IAM Identity Center and Organizations as a result of this issue.

We will continue to provide updates every 30-60 minutes, or sooner if we have additional information to share.

UPDATE about 2 months ago - at 07/30/2024 11:58PM

We continue to work on resolving the increased error rates and latencies for Kinesis APIs in the US-EAST-1 Region. We have identified the root cause and are actively working on multiple parallel paths to mitigate the issue. As a result of this issue, CloudWatch logs continues to see delayed log delivery but metrics continue to operate normally. Some customers may also be experiencing elevated failures with IAM Identity Center and Organizations as a result of this issue. We will continue to provide updates as we make progress.

UPDATE about 2 months ago - at 07/30/2024 10:59PM

We can confirm increased error rates and latencies for Kinesis APIs within the US-EAST-1 Region. We have identified the root cause and are actively working to resolve the issue. As a result of this issue, other services, such as CloudWatch, are also experiencing increase error rates and delayed Cloudwatch log delivery. We will continue to keep you updated as we make progress in resolving the issue.

UPDATE about 2 months ago - at 07/30/2024 10:40PM

We are seeing increased error rates and latencies for some service APIs within the US-EAST-1 region.


Cut Vendor Outage Costs with an Internal Status Page

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3242 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime