TL;DR: From AWS Kinesis cascading in 2020 to CrowdStrike taking down 8.5 million Windows machines in 2024, the same patterns repeat: single points of failure, status pages that lag reality by hours, and engineering teams finding out from users instead of monitoring. Here's the full timeline, what went wrong, and why independent monitoring is the only reliable fix.
Cloud infrastructure has never been more reliable in theory. In practice, the last six years of cloud outage history have delivered some of the most disruptive incidents on record. Not because cloud providers got worse, but because the systems built on top of them got larger, more interconnected, and more brittle in ways that don't show up until everything breaks at once.
On November 25, 2020, AWS added capacity to Kinesis Data Streams in us-east-1. The change triggered a memory pressure issue in the front-end fleet. What followed was a cascade: Kinesis is the backbone for AWS's own internal telemetry pipeline. Services that depended on Kinesis for their own health reporting couldn't report. Cognito went down. CloudWatch went down. The AWS Console became unreliable. The outage lasted approximately 17 hours. The AWS status page showed green for most of it.
What went wrong:
On October 4, 2021, a configuration change during routine maintenance caused Facebook to withdraw its BGP routes from the global internet. Facebook disappeared. DNS stopped resolving, IP addresses vanished from routing tables. This lasted ~6.5 hours.
Facebook's internal tools also went down, including the badge systems controlling physical access to data centers. Engineers had to be dispatched on-site to fix the issue, since remote access was impossible. Instagram, WhatsApp, and Oculus went down simultaneously.
What went wrong:
December 7, 2021. An automated process triggered an unexpected surge in internal network activity, overloading networking devices, causing failures across EC2, EBS, Lambda, RDS, and dozens of other services.
The blast radius was enormous: Disney+, Tinder, Alexa, DoorDash, Venmo, Twitch, all degraded or down. Many teams assumed they were resilient. They discovered they weren't. The AWS status page initially showed only minor issues.
What went wrong:
The Hard Truth: AWS's status page is not a monitoring solution. It reflects what AWS has confirmed and chosen to disclose, not what your users are experiencing right now. Engineering teams that relied on it during the December 2021 outage were flying blind for hours.
On June 21, 2022, Cloudflare pushed a change to its network backbone intended to increase resilience. The change caused 19 data centers to go offline simultaneously, disrupting roughly 50% of Cloudflare's global traffic, from just 19 locations, which represent only 4% of the total network. Duration: approximately 1 hour. Impact: enormous. Websites using Cloudflare-managed DNS couldn't resolve. Applications behind Cloudflare's proxy returned errors.
What went wrong:
A planned update to a WAN router IP address caused the router to send messages to all other routers in the WAN, triggering a recomputation of adjacency and forwarding tables across the network. Azure AD (now Entra ID), Outlook, Teams, and hundreds of dependent services went offline. Duration: ~5 hours.
This was particularly damaging because Azure AD is the authentication layer for Microsoft 365. When it goes down, people can't log in to email or Teams. Enterprises that had moved everything to Microsoft's cloud discovered they had created a single throat to choke.
What went wrong:
On July 19, 2024, CrowdStrike pushed a sensor configuration update to Windows hosts running the Falcon agent. The update contained a logic error that caused Windows to crash with a BSOD on boot. Affected machines: 8.5 million. Airlines cancelled thousands of flights. Hospitals reverted to paper. Banks couldn't process transactions.
Recovery required a manual process on millions of individual machines: booting into safe mode and deleting a specific file. Some organizations took days to fully recover.
What went wrong:
Pro-Tip: Connect your incident alerting to PagerDuty via IsDown so that when a vendor service degrades, before they've updated their status page, your on-call team is already being paged. Based on IsDown data across 6,000+ monitored services, the gap between 'outage starts' and 'vendor acknowledges' averages 45–90 minutes. That's time you don't have.
| Incident | Year | Duration | Primary Pattern |
|---|---|---|---|
| AWS Kinesis Cascade | 2020 | ~17 hours | Single point of failure in observability stack |
| Facebook BGP Withdrawal | 2021 | 6.5 hours | Global config change, no staged rollout |
| AWS us-east-1 Cascade | 2021 | ~6 hours | Hidden dependencies + status page lag |
| Cloudflare BGP Routing | 2022 | ~1 hour | Resilience change creates fragility |
| Azure Global WAN | 2023 | ~5 hours | Authentication as single point of failure |
| CrowdStrike Falcon | 2024 | Days (recovery) | Supply chain + no canary deployment |
None of these organizations set out to build fragile systems. The single points of failure emerged from complexity over time: each incremental decision reasonable in isolation, catastrophic in combination. Your architecture review process needs to specifically ask "what happens if this dependency is unavailable?" for every external service in your stack.
In every major outage above, the initial failure was not the full story. Cascade effects happen when failure propagates across system boundaries in ways that weren't modeled during design. The defense is not just redundancy. It's isolation: fault domains, circuit breakers, graceful degradation.
Every major incident above featured a gap between actual user impact and official status page acknowledgment. This isn't bad faith. It's structural. Vendors investigate before they post. Legal reviews happen. PR considerations apply. The result: engineering teams watching status pages are always behind.
IsDown monitors 6,000+ vendor status pages and surfaces incidents as they happen, not when the vendor decides to acknowledge them. In the December 2021 AWS outage, teams using independent monitoring detected the issue up to 2 hours before AWS acknowledged it on the status page.
The AWS 2020 Kinesis incident illustrated this perfectly. CloudWatch, the tool you'd use to investigate, was itself downstream of Kinesis. When Kinesis broke, CloudWatch broke. When CloudWatch broke, alarms stopped firing.
This is why external monitoring that doesn't run on the same infrastructure it monitors is not optional. If your monitoring depends on the thing it's monitoring, it will fail you in exactly the scenarios where you need it most.
Based on six years of major outages, three persistent mistakes stand out:
By scope of impact, the July 2024 CrowdStrike incident, which affected 8.5 million Windows machines across airlines, hospitals, banks, and emergency services, was the largest IT outage in recorded history. By duration and direct financial impact, the October 2021 Facebook BGP withdrawal (6+ hours, estimated $100M+ in losses to Facebook alone) and December 2021 AWS outage rank among the most damaging for the broader internet ecosystem.
Major outages affecting multiple services or regions occur several times per year across the large cloud providers. AWS, Azure, and Google Cloud each experience multiple significant incidents annually. Minor service degradations (affecting individual services or specific regions) occur monthly. Historical data from IsDown across 6,000+ monitored services shows that most engineering teams experience at least one significant vendor-caused incident per quarter.
Independent third-party monitoring that doesn't rely on the vendor's own infrastructure is the only reliable approach. Tools like IsDown aggregate signals from vendor status pages and synthetic checks to surface incidents minutes to hours before official acknowledgment. Pair this with alerts routed into your incident management workflow so your team has context before users start reporting problems.
Multi-region architecture helps but is not a complete solution. The December 2021 AWS outage affected teams that believed they had multi-region redundancy, because they had hidden dependencies on us-east-1 shared services. Before investing in multi-region, map all your dependencies (including third-party SaaS tools) and verify that each one can actually fail independently. Then test it.
Configuration changes are the leading cause: BGP route changes, capacity adjustments, software updates, and firewall rule changes all appear repeatedly in major incident postmortems. The second most common cause is cascade effects from a single service failure propagating to unexpected dependencies. The CrowdStrike incident introduced a third category: third-party software supply chain failures that bypass normal change management processes.
Nuno Tomas
Founder of IsDown
For IT Managers
Monitor all your dependencies in one place
One dashboard to check all status pages
A bird's eye view of all your services in one place.
Get alerts when your vendors are down
Notifications in Slack, Datadog, PagerDuty, etc.
14-day free trial · No credit card required · No code required