AWS Outage History: What Engineering Teams Should Learn

Published at Apr 22, 2026.

TL;DR: AWS has a documented pattern of outages concentrated in us-east-1, cascade failures across dependent services, and status page updates that lag actual impact by 30–90+ minutes. Engineering teams that rely on AWS's own status page to detect outages are flying blind. Third-party monitoring closes that gap.

The Uncomfortable Pattern in AWS Outage History

If you've been running production workloads on AWS for more than a year, you've felt it: the 3 am PagerDuty alert, the scramble to check the AWS console, the frantic Slack thread asking, "Is this us or is this AWS?"

And then, minutes or hours later, the AWS Service Health Dashboard finally acknowledges what your users have been experiencing all along.

It happens because AWS is the backbone of modern infrastructure. A significant share of the internet's infrastructure runs on it, by most estimates, around a third. That scale makes AWS outages consequential, and it makes AWS outage history worth studying, not to bash the vendor, but to build better incident response.

Key AWS Outages Worth Knowing

December 2021: The Big One

The December 7, 2021, outage remains the most widely discussed AWS incident in recent memory. It started in us-east-1 and triggered cascading failures across services that most engineering teams didn't expect to be interconnected.

What went wrong: an automated scaling activity on AWS's main network triggered unexpected behavior from clients on the internal network, overwhelming the network devices connecting the two networks and creating a congestion feedback loop. Amazon's own internal tools, including the tools they use to monitor and diagnose issues, were affected, which made identifying and resolving the issue significantly slower.

The blast radius was enormous:

Amazon.com: Core retail and consumer services were partially affected during peak hours.
Alexa and Ring: Smart home devices went dark for millions of households.
Disney+, Netflix, and Hulu: Streaming services experienced degraded performance across the board.
Slack, Salesforce, and others: Dozens of SaaS products built on AWS reported issues.
Amazon Logistics: Delivery services, including Amazon's own operations network, had real-world operational disruptions.

The AWS Service Health Dashboard was slow to reflect the full scope of impact. By the time AWS posted a comprehensive update, engineering teams had already been firefighting for hours.

November 2020: Kinesis Takes Down Everything

The November 25, 2020, outage started with a single service: Amazon Kinesis. That alone should have been a manageable incident. Instead, it exposed how deeply Kinesis is embedded in AWS's own service mesh.

Because Kinesis is used internally by AWS for logging and event processing, a Kinesis failure cascaded to:

Cognito: Authentication failures left users unable to log in to dependent applications.
CloudWatch: Monitoring went blind at exactly the moment teams needed visibility most.
Auto Scaling, ECS and EKS: Cluster management and scaling operations were delayed or failed entirely.
Lambda: Serverless functions experienced increased error rates due to memory contention on underlying hosts.

Teams relying on CloudWatch to detect the issue couldn't trust their own monitoring. This is the nightmare scenario: the tool you use to detect problems is itself broken.

The us-east-1 Problem

Look at AWS's post-incident reports over the past decade, and a pattern emerges: us-east-1 (Northern Virginia) is disproportionately affected by major outages.

Year	Region	Primary Service Affected	Approx. Duration
2021	us-east-1	Networking / Internal tooling	~7–11 hours
2020	us-east-1	Kinesis + cascade	~8–20 hours
2017	us-east-1	S3	~4 hours
2015	us-east-1	DynamoDB	~3-5 hours
2012	us-east-1	EC2, EBS (multiple events)	varies

us-east-1 is AWS's oldest region and houses the highest density of infrastructure. When it fails, it fails hard, and it takes a lot of other things with it.

The Hard Truth: If your workloads are exclusively in us-east-1 and you haven't rehearsed a failover to us-east-2 or eu-west-1, you're not running a resilient architecture. You're running a single point of failure with a pretty UI.

Why AWS's Status Page Isn't Enough

AWS's Service Health Dashboard is a useful artifact. It is not a reliable real-time detection tool.

The problem is structural. AWS is a for-profit company with legal and reputational incentives to minimize the apparent scope of outages. Status page updates are reviewed before posting. That review takes time. During major incidents, internal communication channels are overwhelmed.

The result: engineering teams observing real customer impact often see the AWS status page showing "Service is operating normally" for 30 to 90+ minutes into an active outage.

Pro-Tip: Track how often you experience AWS impact before their status page updates. In most organizations, that number is higher than leadership realizes. That gap, between real impact and official acknowledgment, is your operational blind spot.

How Engineering Teams Should Respond to This History

1. Multi-Region Architecture Isn't Optional

For any workload where availability matters, design for at least two regions. us-east-1 + us-west-2 is the most common pairing. If you can't go full active-active, at least have a documented failover playbook that someone has actually tested in the past 90 days.

2. Assume Your Monitoring Will Be Affected

The 2020 Kinesis outage should permanently change how you think about AWS-hosted monitoring. If you're relying solely on CloudWatch alarms to detect AWS-caused problems, you have a dependency problem: when AWS breaks, your ability to detect the break breaks too.

Complement AWS-native monitoring with external observability that doesn't run on the same infrastructure it's monitoring.

3. Build Your Own Early Warning System

IsDown monitors AWS status and 6,000+ other vendor status pages, combining official status data with crowdsourced user reports to detect outages before vendors officially acknowledge them. Pair that with PagerDuty integration and your on-call rotation gets vendor outage alerts in the same place as infrastructure alerts.

4. Document Your AWS Dependency Map

Most engineering teams dramatically underestimate how many internal systems depend on AWS services. When Kinesis fails, does your authentication break? Does your logging break? Does your deployment pipeline break?

Build and maintain a dependency map. When an AWS outage hits, you need to know within 60 seconds which of your systems are likely affected.

5. Separate "AWS Is Down" from "We Are Down"

These are different problems requiring different responses. "AWS is down" means: communicate to stakeholders, execute your runbook, wait for AWS to recover. "We are down" means: debug your code. Conflating the two wastes engineering time and creates noise at exactly the worst moment.

What Procurement and Vendor Risk Teams Should Take From This

Exposure assessment: What is our actual exposure if AWS us-east-1 goes down for 8 hours?
SLA coverage: Do our customer SLAs account for vendor-caused outages?
Notification speed: How are we notified of AWS issues, and how fast?
Contractual recourse: Do we have contractual recourse for AWS downtime that affects our customers?
Continuity testing: Have we actually tested our continuity plan against a full us-east-1 failure scenario?

The Hard Truth: AWS's SLA credits rarely cover the actual business cost of downtime. A 10% monthly credit on a $50,000 bill is $5,000. The cost of one major incident, counting engineering hours, customer churn, and SLA penalties to your own customers, often exceeds that by an order of magnitude.

Frequently Asked Questions

How often does AWS go down?

AWS experiences some form of service degradation or outage multiple times per month across its global infrastructure. Major, widely felt outages affecting core services in us-east-1 have occurred roughly once to twice per year over the past five years. Minor, service-specific incidents are significantly more frequent.

Is us-east-1 really less reliable than other AWS regions?

Based on the historical record of major incidents, yes. us-east-1 has been the primary affected region in the most consequential AWS outages. Multi-region architectures that can fail away from us-east-1 have meaningfully better real-world resilience.

How do I get notified of AWS outages faster than the AWS status page?

Third-party monitoring services aggregate signals from AWS's status page and other sources to detect degradation earlier than AWS typically acknowledges it. Combining third-party monitoring with your own synthetic monitoring, particularly synthetics that test user flows touching AWS-backed services, gives you the earliest possible detection window.

What should my runbook include for an AWS outage?

At minimum: a dependency map showing which internal systems rely on which AWS services, escalation contacts, customer communication templates pre-approved for vendor outages, failover procedures if applicable, and a decision tree for distinguishing "AWS is degraded but we're fine" from "AWS degradation is impacting our users." Test this runbook at least quarterly.

Does AWS provide credits for downtime?

AWS offers SLA credits for many services when uptime falls below the SLA threshold for a billing month. However, claiming credits requires detecting the issue, documenting the impact, and filing within AWS's specified timeframe. Credits are calculated against your service spend for that period, not against the business cost of the outage.

Nuno Tomas Founder of IsDown

For IT Managers

Monitor all your dependencies in one place

One dashboard with all vendors statuses

A bird's-eye view of all your services in one place.

Get alerts when your vendors are down

Notifications in Slack, Datadog, PagerDuty, etc.

Start Free Trial

Sep 30, 2025

Top 10 Reasons Why You Need a Status Page Aggregator

Discover why a status page aggregator is essential for monitoring multiple vendors. Learn how to centralize alerts and improve incident response.

Jun 16, 2026

IsDown is joining UptimeRobot

IsDown has been acquired by UptimeRobot. Your plan, login, and data stay the same. Here's what's changing, what isn't, and the legal details.

May 20, 2026

Error Budget in SRE: The Complete Guide (2026)

Error budgets translate your SLO into a measurable allowance for failure. Learn how to calculate, defend, and spend your error budget - and why vendor outages silently drain it.

May 13, 2026

Cloud Outage History: Six Years of Recurring Failures

Six years of major cloud outages dissected - AWS, Cloudflare, CrowdStrike and more. Root causes, failure patterns, and what SRE teams keep getting wrong.