How to Monitor AWS Status: Don't Wait for the Health Dashboard

Published at Mar 12, 2026.

TL;DR: The AWS Health Dashboard is slow, sometimes broken during major outages, and only tells you what AWS admits is broken. Real SREs layer three monitoring sources: AWS-native tools (CloudWatch, EventBridge), third-party aggregators (IsDown), and internal synthetic checks. Skip the vendor status page as your primary alert source.

The uncomfortable truth: AWS's Health Dashboard is a visibility tool for AWS, not for you. The proof came in January 2022.

During the us-east-1 incident, the Health Dashboard itself became unavailable for hours — as documented in AWS's own post-incident summary.

But even during "normal" outages, the lag is brutal.

Official status pages lag 30–60 minutes behind actual service degradation — a pattern consistently observed across public post-mortems and SRE incident reports. AWS engineers are fighting fires; updating the dashboard is not the priority.
Your customers will call you first. You'll know about the problem from alerts, error logs, and angry Slack messages before AWS publishes anything.
The Health Dashboard shows only what AWS has formally acknowledged. Partial failures, throttling, elevated error rates that don't breach AWS's threshold — these stay invisible on the status page.

Teams that rely solely on the Health Dashboard are playing defense. They wait for AWS to tell them something is wrong, then scramble to understand the blast radius. That's backwards.

The teams that win have detection before the status page updates. They know about the issue because their own monitoring caught it.

The Real Monitoring Stack: Three Layers

Effective AWS monitoring isn't one tool. It's three overlapping systems, each catching what the others miss.

Layer 1: AWS-Native Signals (CloudWatch + EventBridge)

Best for: Catching real-time degradation in your account, region-specific issues, and service quota exhaustion.

AWS gives you two native windows into the health of your infrastructure:

CloudWatch Metrics & Alarms

Set up alarms on the metrics that matter: API latency, error rates, throttling, and request counts. These are your signals, not AWS's interpretation. When S3 starts returning more errors than usual, you'll see it in your metrics before AWS updates the Health Dashboard.

Example: If you're running in us-east-1 and Route 53 starts failing, you'll see failed DNS lookups in your application logs and elevated error rates immediately. The Health Dashboard might not acknowledge it for 30 minutes.

EventBridge Rules for AWS Health Events

AWS publishes personal health events through EventBridge. These events arrive faster than the status page updates — sometimes 10–15 minutes earlier — because they're automatically triggered by internal AWS systems, not by humans writing status updates.

Anti-Pattern: Treating EventBridge health events as your only source. They're incomplete. A service can be degraded without triggering a health event.

Pro-Tip: Combine EventBridge health events with synthetic checks. When a health event fires, verify it by actually testing the service from your region.

Layer 2: Third-Party Aggregators

Best for: Catching outages across multiple regions, monitoring non-AWS services your infrastructure depends on, and getting a unified view without building it yourself.

Your infrastructure doesn't live in a vacuum. You depend on AWS services across multiple regions, third-party APIs (Stripe, Twilio, SendGrid, Auth0), CDN providers (CloudFront, Fastly), and DNS services (Route 53, Cloudflare).

Monitoring each vendor's status page separately is unsustainable at scale. You need aggregation.

IsDown aggregates 6,000+ service status pages, including all AWS services, regional endpoints, and third-party dependencies. When us-west-2 EC2 degrades, you know in minutes — before your customers start calling.

Layer 3: Synthetic Monitoring (Your Own Canaries)

Best for: Proving that services actually work from your perspective, catching subtle degradation that status pages miss, and testing cross-region dependencies.

Status pages tell you what vendors claim is working. Synthetic monitoring tells you what actually works from your users' perspective.

Run lightweight synthetic checks from your infrastructure every 30–60 seconds:

EC2 instance launches (us-east-1, us-west-2, eu-west-1)
S3 PUT/GET to a test bucket
RDS connection and simple query
Lambda invocation and response time
CloudFront cache hit/miss ratio
Route 53 DNS resolution latency

When any check fails, you know immediately — before customers report it, before the status page updates.

Multi-Region Monitoring: Why Single-Region Thinking Breaks

The Hard Truth: Most outage runbooks are written assuming a single-region architecture. They break immediately in practice.

When us-east-1 fails:

Your primary region is down
Your failover region (us-west-2) is suddenly getting 10x traffic
Cross-region dependencies (RDS read replicas, S3 replication) can cascade failures
Your backup infrastructure wasn't tested for this load

Best Practice: Tier Your Critical Services

Use this as a starting template — adapt tiers to match your architecture and the AWS regions you actually run in.

Service	Region	Tier	Monitoring	Failover Strategy
EC2 (API servers)	us-east-1 + us-west-2	Critical	CloudWatch + Canary	Active-active or instant failover
RDS (primary)	us-east-1	Critical	EventBridge + CloudWatch	60-second failover to read replica
S3 (config)	us-east-1 (replicated)	Critical	Canary PUT/GET	Read from replica on failure
DynamoDB	Global table	Critical	EventBridge	Automatic failover
Lambda	Multi-region	Medium	Canary invocation	Regional routing

Anti-Pattern: "We'll just fail over to another region." Failover without prior testing is a guess. Run monthly failover drills to prove it works.

Alerting: What to Actually Page On

Not all AWS issues warrant a 3 AM wake-up call.

Page the on-call engineer:

Tier-1 services are down (EC2, RDS, S3 in your primary region) — canary failed
Multiple regions affected — indicates AWS-wide incident
Personal Health Dashboard event + verified by canary
Error rate spike >10% for >2 minutes (your own metric, not AWS's judgment)

Send to Slack (don't page):

Single-region degradation in non-critical services
Status page updates for informational items
Elevated API latencies not yet causing errors

Setting Up AWS Status Monitoring: The Checklist

Week 1: AWS-Native Foundation

Enable CloudWatch alarms for critical metrics (error rate, latency, request count)
Set up EventBridge rule for aws.health events and route to SNS/email
Verify Personal Health Dashboard access for your AWS account
Document which AWS regions are critical for your business

Week 2: Third-Party Aggregation

Add critical AWS services to IsDown (EC2, RDS, S3, Lambda, Route 53, CloudFront)
Configure IsDown Slack integration or PagerDuty integration
Add non-AWS dependencies (Stripe, Twilio, SendGrid, Auth0, etc.)
Test Slack/PagerDuty routing with a manual alert

Week 3: Synthetic Monitoring

Build 3 lightweight canary scripts (EC2, S3, Lambda)
Deploy canaries to run every 60 seconds from primary + failover regions
Test alert routing when a canary fails (disable service temporarily)
Document runbook: "If canary X fails, do Y"

Week 4: Integration & Runbooks

Map all external dependencies (list every third-party service your app calls)
Build dependency tier chart (Critical, High, Medium, Low)
Document cross-region failover procedure
Run a mock incident: "us-east-1 EC2 is down, simulate failover"
Update on-call runbook with correct alert sources (IsDown + CloudWatch, not Health Dashboard)

If you'd rather skip building the aggregation layer yourself, IsDown monitors 6,000+ services — including all AWS regions and the third-party dependencies your infrastructure relies on. Slack, PagerDuty, and Datadog integrations included. Most teams are set up in under 10 minutes.

Frequently Asked Questions

Does IsDown replace the AWS Health Dashboard?

No. IsDown augments the Health Dashboard. IsDown is faster and covers more services (AWS + third-party), but the Health Dashboard is still your authoritative source for AWS's official status. Use IsDown as your primary early-warning system, then cross-reference with the Health Dashboard to get official AWS communication.

Should we set up alerts for every AWS service?

No. Alert fatigue is as dangerous as missing real alerts. Tier your services: Critical services (EC2, RDS, S3 if you depend on them) get instant notifications. Medium-tier services go to a Slack channel. Low-tier services get weekly reviews. Most teams over-monitor and tune out real alerts.

What's the best way to monitor CloudFront?

CloudFront outages are rare but brutal because they're often regional and difficult to diagnose. Monitor: CloudFront health events in EventBridge, origin error rate in CloudWatch, cache hit ratio for your distributions, and a synthetic request from a popular origin point. If your distribution goes dark, you need to know in seconds, not minutes.

Can we rely on AWS Health events alone?

No. Health events are sent when AWS formally acknowledges an issue. During the early stages of an outage, before AWS posts a status, your canaries will have already detected it. Use Health events as a confirmation signal, not as your primary detection mechanism.

How do we know if an outage is hitting us specifically vs. the whole region?

Run canaries from your infrastructure. If your canary in us-east-1 fails but IsDown shows EC2 us-east-1 operational, the problem is specific to your application or account, not AWS-wide. This distinction is critical for on-call decision-making.

What metrics should we alarm on?

Start with these four: Error rate (API 5xx responses), latency (p99 and p95), request volume (sudden drops indicate cascading failures), and custom business metrics (failed transactions, unprocessed jobs). For AWS-specific: check API rate limit headroom, DynamoDB throttling, and RDS connection pool saturation.

Nuno Tomas Founder of IsDown

For IT Managers

Monitor all your dependencies in one place

One dashboard with all vendors statuses

A bird's-eye view of all your services in one place.

Get alerts when your vendors are down

Notifications in Slack, Datadog, PagerDuty, etc.

Start Free Trial

Sep 30, 2025

Top 10 Reasons Why You Need a Status Page Aggregator

Discover why a status page aggregator is essential for monitoring multiple vendors. Learn how to centralize alerts and improve incident response.

Jun 16, 2026

IsDown is joining UptimeRobot

IsDown has been acquired by UptimeRobot. Your plan, login, and data stay the same. Here's what's changing, what isn't, and the legal details.

May 20, 2026

Error Budget in SRE: The Complete Guide (2026)

Error budgets translate your SLO into a measurable allowance for failure. Learn how to calculate, defend, and spend your error budget - and why vendor outages silently drain it.

May 13, 2026

Cloud Outage History: Six Years of Recurring Failures

Six years of major cloud outages dissected - AWS, Cloudflare, CrowdStrike and more. Root causes, failure patterns, and what SRE teams keep getting wrong.

May 3, 2026

April 2026: IsDown Users Saved 16.5 Hours with Early Outage Detection

IsDown detected 45 outages up to 3.6 hours before vendors acknowledged them in April 2026, plus 104 incidents vendors never reported.

Apr 22, 2026

AWS Outage History: What Engineering Teams Should Learn

AWS outage history follows a predictable pattern: us-east-1, cascade failures, status pages that lag 30-90+ minutes. Here's what engineering teams should learn.