TL;DR: AWS has a documented pattern of outages concentrated in us-east-1, cascade failures across dependent services, and status page updates that lag actual impact by 30–90+ minutes. Engineering teams that rely on AWS's own status page to detect outages are flying blind. Third-party monitoring closes that gap.
If you've been running production workloads on AWS for more than a year, you've felt it: the 3 am PagerDuty alert, the scramble to check the AWS console, the frantic Slack thread asking, "Is this us or is this AWS?"
And then, minutes or hours later, the AWS Service Health Dashboard finally acknowledges what your users have been experiencing all along.
It happens because AWS is the backbone of modern infrastructure. A significant share of the internet's infrastructure runs on it, by most estimates, around a third. That scale makes AWS outages consequential, and it makes AWS outage history worth studying, not to bash the vendor, but to build better incident response.
The December 7, 2021, outage remains the most widely discussed AWS incident in recent memory. It started in us-east-1 and triggered cascading failures across services that most engineering teams didn't expect to be interconnected.
What went wrong: an automated scaling activity on AWS's main network triggered unexpected behavior from clients on the internal network, overwhelming the network devices connecting the two networks and creating a congestion feedback loop. Amazon's own internal tools, including the tools they use to monitor and diagnose issues, were affected, which made identifying and resolving the issue significantly slower.
The blast radius was enormous:
The AWS Service Health Dashboard was slow to reflect the full scope of impact. By the time AWS posted a comprehensive update, engineering teams had already been firefighting for hours.
The November 25, 2020, outage started with a single service: Amazon Kinesis. That alone should have been a manageable incident. Instead, it exposed how deeply Kinesis is embedded in AWS's own service mesh.
Because Kinesis is used internally by AWS for logging and event processing, a Kinesis failure cascaded to:
Teams relying on CloudWatch to detect the issue couldn't trust their own monitoring. This is the nightmare scenario: the tool you use to detect problems is itself broken.
Look at AWS's post-incident reports over the past decade, and a pattern emerges: us-east-1 (Northern Virginia) is disproportionately affected by major outages.
| Year | Region | Primary Service Affected | Approx. Duration |
|---|---|---|---|
| 2021 | us-east-1 | Networking / Internal tooling | ~7–11 hours |
| 2020 | us-east-1 | Kinesis + cascade | ~8–20 hours |
| 2017 | us-east-1 | S3 | ~4 hours |
| 2015 | us-east-1 | DynamoDB | ~3-5 hours |
| 2012 | us-east-1 | EC2, EBS (multiple events) | varies |
us-east-1 is AWS's oldest region and houses the highest density of infrastructure. When it fails, it fails hard, and it takes a lot of other things with it.
The Hard Truth: If your workloads are exclusively in us-east-1 and you haven't rehearsed a failover to us-east-2 or eu-west-1, you're not running a resilient architecture. You're running a single point of failure with a pretty UI.
AWS's Service Health Dashboard is a useful artifact. It is not a reliable real-time detection tool.
The problem is structural. AWS is a for-profit company with legal and reputational incentives to minimize the apparent scope of outages. Status page updates are reviewed before posting. That review takes time. During major incidents, internal communication channels are overwhelmed.
The result: engineering teams observing real customer impact often see the AWS status page showing "Service is operating normally" for 30 to 90+ minutes into an active outage.
Pro-Tip: Track how often you experience AWS impact before their status page updates. In most organizations, that number is higher than leadership realizes. That gap, between real impact and official acknowledgment, is your operational blind spot.
For any workload where availability matters, design for at least two regions. us-east-1 + us-west-2 is the most common pairing. If you can't go full active-active, at least have a documented failover playbook that someone has actually tested in the past 90 days.
The 2020 Kinesis outage should permanently change how you think about AWS-hosted monitoring. If you're relying solely on CloudWatch alarms to detect AWS-caused problems, you have a dependency problem: when AWS breaks, your ability to detect the break breaks too.
Complement AWS-native monitoring with external observability that doesn't run on the same infrastructure it's monitoring.
IsDown monitors AWS status and 6,000+ other vendor status pages, combining official status data with crowdsourced user reports to detect outages before vendors officially acknowledge them. Pair that with PagerDuty integration and your on-call rotation gets vendor outage alerts in the same place as infrastructure alerts.
Most engineering teams dramatically underestimate how many internal systems depend on AWS services. When Kinesis fails, does your authentication break? Does your logging break? Does your deployment pipeline break?
Build and maintain a dependency map. When an AWS outage hits, you need to know within 60 seconds which of your systems are likely affected.
These are different problems requiring different responses. "AWS is down" means: communicate to stakeholders, execute your runbook, wait for AWS to recover. "We are down" means: debug your code. Conflating the two wastes engineering time and creates noise at exactly the worst moment.
The Hard Truth: AWS's SLA credits rarely cover the actual business cost of downtime. A 10% monthly credit on a $50,000 bill is $5,000. The cost of one major incident, counting engineering hours, customer churn, and SLA penalties to your own customers, often exceeds that by an order of magnitude.
AWS experiences some form of service degradation or outage multiple times per month across its global infrastructure. Major, widely felt outages affecting core services in us-east-1 have occurred roughly once to twice per year over the past five years. Minor, service-specific incidents are significantly more frequent.
Based on the historical record of major incidents, yes. us-east-1 has been the primary affected region in the most consequential AWS outages. Multi-region architectures that can fail away from us-east-1 have meaningfully better real-world resilience.
Third-party monitoring services aggregate signals from AWS's status page and other sources to detect degradation earlier than AWS typically acknowledges it. Combining third-party monitoring with your own synthetic monitoring, particularly synthetics that test user flows touching AWS-backed services, gives you the earliest possible detection window.
At minimum: a dependency map showing which internal systems rely on which AWS services, escalation contacts, customer communication templates pre-approved for vendor outages, failover procedures if applicable, and a decision tree for distinguishing "AWS is degraded but we're fine" from "AWS degradation is impacting our users." Test this runbook at least quarterly.
AWS offers SLA credits for many services when uptime falls below the SLA threshold for a billing month. However, claiming credits requires detecting the issue, documenting the impact, and filing within AWS's specified timeframe. Credits are calculated against your service spend for that period, not against the business cost of the outage.
Nuno Tomas
Founder of IsDown
The Status Page Aggregator with Early Outage Detection
Unified vendor dashboard
Early Outage Detection
Stop the Support Flood
14-day free trial · No credit card required · No code required