TL;DR: The AWS Health Dashboard is slow, sometimes broken during major outages, and only tells you what AWS admits is broken. Real SREs layer three monitoring sources: AWS-native tools (CloudWatch, EventBridge), third-party aggregators (IsDown), and internal synthetic checks. Skip the vendor status page as your primary alert source.
The uncomfortable truth: AWS's Health Dashboard is a visibility tool for AWS, not for you. The proof came in January 2022.
During the us-east-1 incident, the Health Dashboard itself became unavailable for hours — as documented in AWS's own post-incident summary.
But even during "normal" outages, the lag is brutal.
Teams that rely solely on the Health Dashboard are playing defense. They wait for AWS to tell them something is wrong, then scramble to understand the blast radius. That's backwards.
The teams that win have detection before the status page updates. They know about the issue because their own monitoring caught it.
Effective AWS monitoring isn't one tool. It's three overlapping systems, each catching what the others miss.
Best for: Catching real-time degradation in your account, region-specific issues, and service quota exhaustion.
AWS gives you two native windows into the health of your infrastructure:
CloudWatch Metrics & Alarms
Set up alarms on the metrics that matter: API latency, error rates, throttling, and request counts. These are your signals, not AWS's interpretation. When S3 starts returning more errors than usual, you'll see it in your metrics before AWS updates the Health Dashboard.
Example: If you're running in us-east-1 and Route 53 starts failing, you'll see failed DNS lookups in your application logs and elevated error rates immediately. The Health Dashboard might not acknowledge it for 30 minutes.
EventBridge Rules for AWS Health Events
AWS publishes personal health events through EventBridge. These events arrive faster than the status page updates — sometimes 10–15 minutes earlier — because they're automatically triggered by internal AWS systems, not by humans writing status updates.
Anti-Pattern: Treating EventBridge health events as your only source. They're incomplete. A service can be degraded without triggering a health event.
Pro-Tip: Combine EventBridge health events with synthetic checks. When a health event fires, verify it by actually testing the service from your region.
Best for: Catching outages across multiple regions, monitoring non-AWS services your infrastructure depends on, and getting a unified view without building it yourself.
Your infrastructure doesn't live in a vacuum. You depend on AWS services across multiple regions, third-party APIs (Stripe, Twilio, SendGrid, Auth0), CDN providers (CloudFront, Fastly), and DNS services (Route 53, Cloudflare).
Monitoring each vendor's status page separately is unsustainable at scale. You need aggregation.
IsDown aggregates 6,000+ service status pages, including all AWS services, regional endpoints, and third-party dependencies. When us-west-2 EC2 degrades, you know in minutes — before your customers start calling.
Best for: Proving that services actually work from your perspective, catching subtle degradation that status pages miss, and testing cross-region dependencies.
Status pages tell you what vendors claim is working. Synthetic monitoring tells you what actually works from your users' perspective.
Run lightweight synthetic checks from your infrastructure every 30–60 seconds:
When any check fails, you know immediately — before customers report it, before the status page updates.
The Hard Truth: Most outage runbooks are written assuming a single-region architecture. They break immediately in practice.
When us-east-1 fails:
Best Practice: Tier Your Critical Services
Use this as a starting template — adapt tiers to match your architecture and the AWS regions you actually run in.
| Service | Region | Tier | Monitoring | Failover Strategy |
|---|---|---|---|---|
| EC2 (API servers) | us-east-1 + us-west-2 | Critical | CloudWatch + Canary | Active-active or instant failover |
| RDS (primary) | us-east-1 | Critical | EventBridge + CloudWatch | 60-second failover to read replica |
| S3 (config) | us-east-1 (replicated) | Critical | Canary PUT/GET | Read from replica on failure |
| DynamoDB | Global table | Critical | EventBridge | Automatic failover |
| Lambda | Multi-region | Medium | Canary invocation | Regional routing |
Anti-Pattern: "We'll just fail over to another region." Failover without prior testing is a guess. Run monthly failover drills to prove it works.
Not all AWS issues warrant a 3 AM wake-up call.
Page the on-call engineer:
Send to Slack (don't page):
Week 1: AWS-Native Foundation
Week 2: Third-Party Aggregation
Week 3: Synthetic Monitoring
Week 4: Integration & Runbooks
If you'd rather skip building the aggregation layer yourself, IsDown monitors 6,000+ services — including all AWS regions and the third-party dependencies your infrastructure relies on. Slack, PagerDuty, and Datadog integrations included. Most teams are set up in under 10 minutes.
No. IsDown augments the Health Dashboard. IsDown is faster and covers more services (AWS + third-party), but the Health Dashboard is still your authoritative source for AWS's official status. Use IsDown as your primary early-warning system, then cross-reference with the Health Dashboard to get official AWS communication.
No. Alert fatigue is as dangerous as missing real alerts. Tier your services: Critical services (EC2, RDS, S3 if you depend on them) get instant notifications. Medium-tier services go to a Slack channel. Low-tier services get weekly reviews. Most teams over-monitor and tune out real alerts.
CloudFront outages are rare but brutal because they're often regional and difficult to diagnose. Monitor: CloudFront health events in EventBridge, origin error rate in CloudWatch, cache hit ratio for your distributions, and a synthetic request from a popular origin point. If your distribution goes dark, you need to know in seconds, not minutes.
No. Health events are sent when AWS formally acknowledges an issue. During the early stages of an outage, before AWS posts a status, your canaries will have already detected it. Use Health events as a confirmation signal, not as your primary detection mechanism.
Run canaries from your infrastructure. If your canary in us-east-1 fails but IsDown shows EC2 us-east-1 operational, the problem is specific to your application or account, not AWS-wide. This distinction is critical for on-call decision-making.
Start with these four: Error rate (API 5xx responses), latency (p99 and p95), request volume (sudden drops indicate cascading failures), and custom business metrics (failed transactions, unprocessed jobs). For AWS-specific: check API rate limit headroom, DynamoDB throttling, and RDS connection pool saturation.
Nuno Tomas
Founder of IsDown
Stop wasting hours on 'is it us or them?'
Unified vendor dashboard
Early Outage Detection
Stop the Support Flood
14-day free trial · No credit card required · No code required