Use cases
Software Products E-commerce MSPs Schools Development & Marketing DevOps Agencies Help Desk
Company
Internet Status Blog Pricing Log in Get started free

Cloud Outage History: Six Years of Recurring Failures

Published at May 13, 2026.
Cloud Outage History: Six Years of Recurring Failures

TL;DR: From AWS Kinesis cascading in 2020 to CrowdStrike taking down 8.5 million Windows machines in 2024, the same patterns repeat: single points of failure, status pages that lag reality by hours, and engineering teams finding out from users instead of monitoring. Here's the full timeline, what went wrong, and why independent monitoring is the only reliable fix.

Cloud infrastructure has never been more reliable in theory. In practice, the last six years of cloud outage history have delivered some of the most disruptive incidents on record. Not because cloud providers got worse, but because the systems built on top of them got larger, more interconnected, and more brittle in ways that don't show up until everything breaks at once.

The Incidents: A Timeline

November 2020 — AWS Kinesis Takes Down Half the Internet

On November 25, 2020, AWS added capacity to Kinesis Data Streams in us-east-1. The change triggered a memory pressure issue in the front-end fleet. What followed was a cascade: Kinesis is the backbone for AWS's own internal telemetry pipeline. Services that depended on Kinesis for their own health reporting couldn't report. Cognito went down. CloudWatch went down. The AWS Console became unreliable. The outage lasted approximately 17 hours. The AWS status page showed green for most of it.

What went wrong:

  • Observability as a dependency: Kinesis was a single point of failure for AWS's internal observability stack
  • Self-referential failure: The system used to detect problems was itself broken
  • No independent signal: Third-party services downstream (Roku, Adobe, Flickr, Autodesk) had no way to know what was happening

October 2021 — Facebook's BGP Withdrawal

On October 4, 2021, a configuration change during routine maintenance caused Facebook to withdraw its BGP routes from the global internet. Facebook disappeared. DNS stopped resolving, IP addresses vanished from routing tables. This lasted ~6.5 hours.

Facebook's internal tools also went down, including the badge systems controlling physical access to data centers. Engineers had to be dispatched on-site to fix the issue, since remote access was impossible. Instagram, WhatsApp, and Oculus went down simultaneously.

What went wrong:

  • No blast radius control: A single configuration change had global scope with no staged rollout
  • Shared fate: Internal tooling and physical security shared the same infrastructure dependency
  • No out-of-band comms: No external communication channel existed. Facebook engineers were using Signal to coordinate

December 2021 — AWS us-east-1: The Christmas Cascade

December 7, 2021. An automated process triggered an unexpected surge in internal network activity, overloading networking devices, causing failures across EC2, EBS, Lambda, RDS, and dozens of other services.

The blast radius was enormous: Disney+, Tinder, Alexa, DoorDash, Venmo, Twitch, all degraded or down. Many teams assumed they were resilient. They discovered they weren't. The AWS status page initially showed only minor issues.

What went wrong:

  • Concentration risk: us-east-1 hosts a disproportionate share of internet infrastructure
  • Hidden dependencies: Services that believed they had redundancy had hidden dependencies on us-east-1 shared services
  • Status page lag: Disclosure lagged actual impact by 2–3 hours

The Hard Truth: AWS's status page is not a monitoring solution. It reflects what AWS has confirmed and chosen to disclose, not what your users are experiencing right now. Engineering teams that relied on it during the December 2021 outage were flying blind for hours.

June 2022 — Cloudflare's Global BGP Routing Failure

On June 21, 2022, Cloudflare pushed a change to its network backbone intended to increase resilience. The change caused 19 data centers to go offline simultaneously, disrupting roughly 50% of Cloudflare's global traffic, from just 19 locations, which represent only 4% of the total network. Duration: approximately 1 hour. Impact: enormous. Websites using Cloudflare-managed DNS couldn't resolve. Applications behind Cloudflare's proxy returned errors.

What went wrong:

  • Ironic failure: The change that caused the failure was intended to improve resilience
  • Incomplete validation: Change validation didn't catch the fault domain overlap
  • No independent monitoring: Downstream teams had no independent monitoring of Cloudflare's health status

January 2023 — Azure Global Outage

A planned update to a WAN router IP address caused the router to send messages to all other routers in the WAN, triggering a recomputation of adjacency and forwarding tables across the network. Azure AD (now Entra ID), Outlook, Teams, and hundreds of dependent services went offline. Duration: ~5 hours.

This was particularly damaging because Azure AD is the authentication layer for Microsoft 365. When it goes down, people can't log in to email or Teams. Enterprises that had moved everything to Microsoft's cloud discovered they had created a single throat to choke.

What went wrong:

  • No auth fallback: Authentication infrastructure had no independent fallback for Microsoft's own SaaS products
  • Blast radius: The WAN change lacked adequate blast radius controls
  • Third-party cascade: Teams using Azure AD for third-party app authentication had cascading failures outside the Microsoft ecosystem

July 2024 — CrowdStrike: The Largest IT Outage in History

On July 19, 2024, CrowdStrike pushed a sensor configuration update to Windows hosts running the Falcon agent. The update contained a logic error that caused Windows to crash with a BSOD on boot. Affected machines: 8.5 million. Airlines cancelled thousands of flights. Hospitals reverted to paper. Banks couldn't process transactions.

Recovery required a manual process on millions of individual machines: booting into safe mode and deleting a specific file. Some organizations took days to fully recover.

What went wrong:

  • QA bypass: A content update bypassed standard QA processes designed for full software releases
  • Civilizational single point of failure: Auto-update with kernel-level access across millions of endpoints created unprecedented concentration risk
  • No canary deployment: No staged rollout. The update went to every endpoint simultaneously
  • Manual-only recovery: No automated remediation path existed

Pro-Tip: Connect your incident alerting to PagerDuty via IsDown so that when a vendor service degrades, before they've updated their status page, your on-call team is already being paged. Based on IsDown data across 6,000+ monitored services, the gap between 'outage starts' and 'vendor acknowledges' averages 45–90 minutes. That's time you don't have.

The Patterns: What These Outages Have in Common

IncidentYearDurationPrimary Pattern
AWS Kinesis Cascade2020~17 hoursSingle point of failure in observability stack
Facebook BGP Withdrawal20216.5 hoursGlobal config change, no staged rollout
AWS us-east-1 Cascade2021~6 hoursHidden dependencies + status page lag
Cloudflare BGP Routing2022~1 hourResilience change creates fragility
Azure Global WAN2023~5 hoursAuthentication as single point of failure
CrowdStrike Falcon2024Days (recovery)Supply chain + no canary deployment

Pattern 1: Single Points of Failure That Don't Look Like Single Points of Failure

None of these organizations set out to build fragile systems. The single points of failure emerged from complexity over time: each incremental decision reasonable in isolation, catastrophic in combination. Your architecture review process needs to specifically ask "what happens if this dependency is unavailable?" for every external service in your stack.

Pattern 2: Cascade Effects Are the Real Killer

In every major outage above, the initial failure was not the full story. Cascade effects happen when failure propagates across system boundaries in ways that weren't modeled during design. The defense is not just redundancy. It's isolation: fault domains, circuit breakers, graceful degradation.

Pattern 3: Status Pages Lie — By Omission, Not Commission

Every major incident above featured a gap between actual user impact and official status page acknowledgment. This isn't bad faith. It's structural. Vendors investigate before they post. Legal reviews happen. PR considerations apply. The result: engineering teams watching status pages are always behind.

IsDown monitors 6,000+ vendor status pages and surfaces incidents as they happen, not when the vendor decides to acknowledge them. In the December 2021 AWS outage, teams using independent monitoring detected the issue up to 2 hours before AWS acknowledged it on the status page.

Pattern 4: The Tools You Use to Detect Problems Break First

The AWS 2020 Kinesis incident illustrated this perfectly. CloudWatch, the tool you'd use to investigate, was itself downstream of Kinesis. When Kinesis broke, CloudWatch broke. When CloudWatch broke, alarms stopped firing.

This is why external monitoring that doesn't run on the same infrastructure it monitors is not optional. If your monitoring depends on the thing it's monitoring, it will fail you in exactly the scenarios where you need it most.

What Engineering Teams Keep Getting Wrong

Based on six years of major outages, three persistent mistakes stand out:

  • Trusting vendor status pages as the primary signal. They're a lagging indicator, not a real-time feed. By the time a vendor posts "we are investigating," you've already been affected for 30-90 minutes.
  • Assuming multi-region means resilient. Multiple regions help, until you discover that a shared service (like Kinesis, or AWS's internal networking) is itself a single point of failure across regions.
  • Treating third-party software updates as low-risk. CrowdStrike redefined what "software supply chain risk" means. Every piece of software with auto-update capability that runs with elevated privileges is a potential single point of failure.

Frequently Asked Questions

What was the worst cloud outage in history?

By scope of impact, the July 2024 CrowdStrike incident, which affected 8.5 million Windows machines across airlines, hospitals, banks, and emergency services, was the largest IT outage in recorded history. By duration and direct financial impact, the October 2021 Facebook BGP withdrawal (6+ hours, estimated $100M+ in losses to Facebook alone) and December 2021 AWS outage rank among the most damaging for the broader internet ecosystem.

How often do major cloud providers experience outages?

Major outages affecting multiple services or regions occur several times per year across the large cloud providers. AWS, Azure, and Google Cloud each experience multiple significant incidents annually. Minor service degradations (affecting individual services or specific regions) occur monthly. Historical data from IsDown across 6,000+ monitored services shows that most engineering teams experience at least one significant vendor-caused incident per quarter.

How do I detect cloud provider outages faster than their status pages?

Independent third-party monitoring that doesn't rely on the vendor's own infrastructure is the only reliable approach. Tools like IsDown aggregate signals from vendor status pages and synthetic checks to surface incidents minutes to hours before official acknowledgment. Pair this with alerts routed into your incident management workflow so your team has context before users start reporting problems.

Should I build multi-region architecture to protect against cloud outages?

Multi-region architecture helps but is not a complete solution. The December 2021 AWS outage affected teams that believed they had multi-region redundancy, because they had hidden dependencies on us-east-1 shared services. Before investing in multi-region, map all your dependencies (including third-party SaaS tools) and verify that each one can actually fail independently. Then test it.

What's the most common cause of major cloud outages?

Configuration changes are the leading cause: BGP route changes, capacity adjustments, software updates, and firewall rule changes all appear repeatedly in major incident postmortems. The second most common cause is cascade effects from a single service failure propagating to unexpected dependencies. The CrowdStrike incident introduced a third category: third-party software supply chain failures that bypass normal change management processes.

Nuno Tomas Nuno Tomas Founder of IsDown

For IT Managers

Monitor all your dependencies in one place

One dashboard to check all status pages

A bird's eye view of all your services in one place.

Get alerts when your vendors are down

Notifications in Slack, Datadog, PagerDuty, etc.

Related articles

Never again lose time looking in the wrong place

14-day free trial · No credit card required · No code required