Outage in AWS

Increased API Error Rates - N. Virginia

Resolved Minor
July 30, 2024 - Started over 1 year ago - Lasted about 6 hours

Incident Report

We are seeing increased error rates and latencies for some service APIs within the US-EAST-1 region.
Components affected
Amazon API Gateway Amazon API Gateway (us-east-1) Amazon AppStream 2.0 Amazon AppStream 2.0 (us-east-1) Amazon CloudFront Amazon CloudSearch Amazon CloudSearch (us-east-1) Amazon CloudWatch Amazon CloudWatch (us-east-1) Amazon Connect Amazon Connect (us-east-1) Amazon DocumentDB Amazon DocumentDB (us-east-1) Amazon ECS Amazon ECS (us-east-1) Amazon EKS Amazon EKS (us-east-1) Amazon EMR Amazon EMR (us-east-1) Amazon ElastiCache Amazon ElastiCache (us-east-1) Amazon FSx Amazon FSx (us-east-1) Amazon Kinesis Analytics Amazon Kinesis Analytics (us-east-1) Amazon Kinesis Data Streams Amazon Kinesis Data Streams (us-east-1) Amazon Kinesis Firehose Amazon Kinesis Firehose (us-east-1) Amazon Location Service Amazon Location Service (us-east-1) Amazon Managed Grafana Amazon Managed Grafana (us-east-1) Amazon Managed Service for Prometheus Amazon Managed Service for Prometheus (us-east-1) Amazon Managed Streaming for Apache Kafka Amazon Managed Streaming for Apache Kafka (us-east-1) Amazon Managed Workflows for Apache Airflow Amazon Managed Workflows for Apache Airflow (us-east-1) Amazon OpenSearch Service Amazon OpenSearch Service (us-east-1) Amazon Personalize Amazon Personalize (us-east-1) Amazon Quantum Ledger Database Amazon Quantum Ledger Database (us-east-1) Amazon Redshift Amazon Redshift (us-east-1) Amazon SageMaker Amazon SageMaker (us-east-1) Amazon SQS Amazon SQS (us-east-1) Amazon S3 Amazon S3 (us-east-1) Amazon WorkSpaces Amazon WorkSpaces (us-east-1) AWS App Runner AWS App Runner (us-east-1) AWS Application Migration Service AWS Application Migration Service (us-east-1) AWS Batch AWS Batch (us-east-1) AWS Client VPN AWS Client VPN (us-east-1) AWS Cloud9 AWS Cloud9 (us-east-1) AWS CloudFormation AWS CloudFormation (us-east-1) AWS CloudHSM AWS CloudHSM (us-east-1) AWS CloudShell AWS CloudShell (us-east-1) AWS CloudTrail AWS CloudTrail (us-east-1) AWS CodeBuild AWS CodeBuild (us-east-1) AWS Control Tower AWS Control Tower (us-east-1) AWS DataSync AWS DataSync (us-east-1) AWS EB AWS EB (us-east-1) AWS Elemental AWS Elemental (us-east-1) AWS Glue AWS Glue (us-east-1) AWS IAM AWS IoT Analytics AWS IoT Analytics (us-east-1) AWS IoT Device Defender AWS IoT Device Defender (us-east-1) AWS IoT Device Management AWS IoT Events AWS IoT Events (us-east-1) AWS IoT SiteWise AWS IoT SiteWise (us-east-1) AWS IoT TwinMaker AWS IoT TwinMaker (us-east-1) AWS Lambda AWS Lambda (us-east-1) AWS License Manager AWS License Manager (us-east-1) AWS Organizations AWS Outposts AWS Outposts (us-east-1) AWS Resource Groups AWS Step Functions AWS Step Functions (us-east-1) EC2 Image Builder EC2 Image Builder (us-east-1) Multiple services Multiple services (us-east-1) AWS AppFabric AWS AppFabric (us-east-1) Amazon Bedrock Amazon Bedrock (us-east-1) Amazon EMR Serverless Amazon EMR Serverless (us-east-1) AWS IAM Identity Center AWS IAM Identity Center (us-east-1) AWS IAM Roles Anywhere AWS IAM Roles Anywhere (us-east-1) AWS IoT Device Management (us-east-1) AWS HealthOmics AWS HealthOmics (us-east-1) AWS Transfer Family AWS Transfer Family (us-east-1)

Need to monitor AWS outages?

One place to monitor all your cloud vendors. Get instant alerts when an outage is detected.

Latest Updates ( sorted recent to last )
UPDATE over 1 year ago - at 07/31/2024 04:32AM

Kinesis Data Streams and Cloudwatch Logs error rates have fully recovered and are operating normally within the US-EAST-1 Region. Other services, including ECS Fargate, API Gateway, and Lambda have also recovered. While we would expect recovery for the vast majority of customer applications, we’re continuing to work towards full recovery.

UPDATE over 1 year ago - at 07/31/2024 03:01AM

We are seeing significant recovery for most AWS Services at this stage. While we are not yet fully recovered, most AWS Services are observing recovery. We are seeing full recovery for Fargate launches at this time. As we recover we expect to see new CloudWatch logs showing as they become available. We continue to work toward full recovery for remaining AWS Services. We continue to expect full recovery to be within the next 2 hours.

UPDATE over 1 year ago - at 07/31/2024 01:59AM

We continue to work toward recovery, though progress is occurring slower than originally anticipated. We are seeing some improvements internally, though they may not be visible externally. Some Services (like Cloudwatch Logs) may not observe recovery until we have fully resolved the underlying issue within the Kinesis subsytem. In parallel to our mitigation efforts, we are actively working to speed up the recovery process. At this time, we still expect full recovery to be 1-2 hours away. We will continue to share updates as we have additional information to share, or within the next 60 minutes.

UPDATE over 1 year ago - at 07/31/2024 01:00AM

We continue to work on resolving the increased error rates and latencies for Kinesis APIs in the US-EAST-1 Region. We wanted to provide you with more details on what is causing the issue. Starting at 2:45 PM PDT, a subsystem within Kinesis began to experience increased contention when processing incoming data. While this had limited impact for most customer workloads, it did cause some internal AWS services - including CloudWatch, ECS Fargate, and API Gateway to experience downstream impact. Engineers have identified the root cause of the issue affecting Kinesis and are working to address the contention. While we are making progress, we expect it to take 2 -3 hours to fully resolve.

As a result of this issue, CloudWatch logs is experiencing increased error rates and latencies when processing incoming logs. Any customer using the CloudWatch logs APIs may experience elevated errors. CloudWatch metrics extraction from these logs may be delayed and alarms may transition into "INSUFFICIENT_DATA" state if set on delayed metrics.

ECS Fargate is experiencing failures when attempting to launch new tasks, also because of a dependency on CloudWatch logs. We are currently working on a change to remove this dependency and have also taken steps to reduce the likelihood of task retirement.

API Gateway continues to process requests correctly but is seeing errors when sending logs to CloudWatch. Some customers may also experience error when using Lambda with API Gateway, but we believe this is related to failures within the Lambda function code itself, such as attempts to invoke CloudWatch logs APIs.

AWS Lambda continues process invocations correctly but is unable to send logs to CloudWatch logs. As a result, customers may not be able to see the logs of their asynchronous Lambda invocations.

We have also seen periods of elevated failures with IAM Identity Center and Organizations as a result of this issue.

We will continue to provide updates every 30-60 minutes, or sooner if we have additional information to share.

UPDATE over 1 year ago - at 07/30/2024 11:58PM

We continue to work on resolving the increased error rates and latencies for Kinesis APIs in the US-EAST-1 Region. We have identified the root cause and are actively working on multiple parallel paths to mitigate the issue. As a result of this issue, CloudWatch logs continues to see delayed log delivery but metrics continue to operate normally. Some customers may also be experiencing elevated failures with IAM Identity Center and Organizations as a result of this issue. We will continue to provide updates as we make progress.

UPDATE over 1 year ago - at 07/30/2024 10:59PM

We can confirm increased error rates and latencies for Kinesis APIs within the US-EAST-1 Region. We have identified the root cause and are actively working to resolve the issue. As a result of this issue, other services, such as CloudWatch, are also experiencing increase error rates and delayed Cloudwatch log delivery. We will continue to keep you updated as we make progress in resolving the issue.

UPDATE over 1 year ago - at 07/30/2024 10:40PM

We are seeing increased error rates and latencies for some service APIs within the US-EAST-1 region.


Status Aggregator for All Your Third-Party Services

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 4600 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook