Use Cases
Software Products MSPs Schools Development & Marketing DevOps Agencies Help Desk
 
Internet Status Blog Pricing Log In Try IsDown for free now

How to Monitor AWS Status: Don't Wait for the Health Dashboard

Published at Mar 12, 2026.

TL;DR: The AWS Health Dashboard is slow, sometimes broken during major outages, and only tells you what AWS admits is broken. Real SREs layer three monitoring sources: AWS-native tools (CloudWatch, EventBridge), third-party aggregators (IsDown), and internal synthetic checks. Skip the vendor status page as your primary alert source.

The uncomfortable truth: AWS's Health Dashboard is a visibility tool for AWS, not for you. The proof came in January 2022.

During the us-east-1 incident, the Health Dashboard itself became unavailable for hours — as documented in AWS's own post-incident summary.

But even during "normal" outages, the lag is brutal.

  • Official status pages lag 30–60 minutes behind actual service degradation — a pattern consistently observed across public post-mortems and SRE incident reports. AWS engineers are fighting fires; updating the dashboard is not the priority.
  • Your customers will call you first. You'll know about the problem from alerts, error logs, and angry Slack messages before AWS publishes anything.
  • The Health Dashboard shows only what AWS has formally acknowledged. Partial failures, throttling, elevated error rates that don't breach AWS's threshold — these stay invisible on the status page.

Teams that rely solely on the Health Dashboard are playing defense. They wait for AWS to tell them something is wrong, then scramble to understand the blast radius. That's backwards.

The teams that win have detection before the status page updates. They know about the issue because their own monitoring caught it.


The Real Monitoring Stack: Three Layers

Effective AWS monitoring isn't one tool. It's three overlapping systems, each catching what the others miss.

Layer 1: AWS-Native Signals (CloudWatch + EventBridge)

Best for: Catching real-time degradation in your account, region-specific issues, and service quota exhaustion.

AWS gives you two native windows into the health of your infrastructure:

CloudWatch Metrics & Alarms

Set up alarms on the metrics that matter: API latency, error rates, throttling, and request counts. These are your signals, not AWS's interpretation. When S3 starts returning more errors than usual, you'll see it in your metrics before AWS updates the Health Dashboard.

Example: If you're running in us-east-1 and Route 53 starts failing, you'll see failed DNS lookups in your application logs and elevated error rates immediately. The Health Dashboard might not acknowledge it for 30 minutes.

EventBridge Rules for AWS Health Events

AWS publishes personal health events through EventBridge. These events arrive faster than the status page updates — sometimes 10–15 minutes earlier — because they're automatically triggered by internal AWS systems, not by humans writing status updates.

Anti-Pattern: Treating EventBridge health events as your only source. They're incomplete. A service can be degraded without triggering a health event.

Pro-Tip: Combine EventBridge health events with synthetic checks. When a health event fires, verify it by actually testing the service from your region.


Layer 2: Third-Party Aggregators

Best for: Catching outages across multiple regions, monitoring non-AWS services your infrastructure depends on, and getting a unified view without building it yourself.

Your infrastructure doesn't live in a vacuum. You depend on AWS services across multiple regions, third-party APIs (Stripe, Twilio, SendGrid, Auth0), CDN providers (CloudFront, Fastly), and DNS services (Route 53, Cloudflare).

Monitoring each vendor's status page separately is unsustainable at scale. You need aggregation.

IsDown aggregates 6,000+ service status pages, including all AWS services, regional endpoints, and third-party dependencies. When us-west-2 EC2 degrades, you know in minutes — before your customers start calling.


Layer 3: Synthetic Monitoring (Your Own Canaries)

Best for: Proving that services actually work from your perspective, catching subtle degradation that status pages miss, and testing cross-region dependencies.

Status pages tell you what vendors claim is working. Synthetic monitoring tells you what actually works from your users' perspective.

Run lightweight synthetic checks from your infrastructure every 30–60 seconds:

  • EC2 instance launches (us-east-1, us-west-2, eu-west-1)
  • S3 PUT/GET to a test bucket
  • RDS connection and simple query
  • Lambda invocation and response time
  • CloudFront cache hit/miss ratio
  • Route 53 DNS resolution latency

When any check fails, you know immediately — before customers report it, before the status page updates.


Multi-Region Monitoring: Why Single-Region Thinking Breaks

The Hard Truth: Most outage runbooks are written assuming a single-region architecture. They break immediately in practice.

When us-east-1 fails:

  • Your primary region is down
  • Your failover region (us-west-2) is suddenly getting 10x traffic
  • Cross-region dependencies (RDS read replicas, S3 replication) can cascade failures
  • Your backup infrastructure wasn't tested for this load

Best Practice: Tier Your Critical Services

Use this as a starting template — adapt tiers to match your architecture and the AWS regions you actually run in.

Service Region Tier Monitoring Failover Strategy
EC2 (API servers) us-east-1 + us-west-2 Critical CloudWatch + Canary Active-active or instant failover
RDS (primary) us-east-1 Critical EventBridge + CloudWatch 60-second failover to read replica
S3 (config) us-east-1 (replicated) Critical Canary PUT/GET Read from replica on failure
DynamoDB Global table Critical EventBridge Automatic failover
Lambda Multi-region Medium Canary invocation Regional routing

Anti-Pattern: "We'll just fail over to another region." Failover without prior testing is a guess. Run monthly failover drills to prove it works.


Alerting: What to Actually Page On

Not all AWS issues warrant a 3 AM wake-up call.

Page the on-call engineer:

  • Tier-1 services are down (EC2, RDS, S3 in your primary region) — canary failed
  • Multiple regions affected — indicates AWS-wide incident
  • Personal Health Dashboard event + verified by canary
  • Error rate spike >10% for >2 minutes (your own metric, not AWS's judgment)

Send to Slack (don't page):

  • Single-region degradation in non-critical services
  • Status page updates for informational items
  • Elevated API latencies not yet causing errors

Setting Up AWS Status Monitoring: The Checklist

Week 1: AWS-Native Foundation

  • Enable CloudWatch alarms for critical metrics (error rate, latency, request count)
  • Set up EventBridge rule for aws.health events and route to SNS/email
  • Verify Personal Health Dashboard access for your AWS account
  • Document which AWS regions are critical for your business


Week 2: Third-Party Aggregation

  • Add critical AWS services to IsDown (EC2, RDS, S3, Lambda, Route 53, CloudFront)
  • Configure IsDown Slack integration or PagerDuty integration
  • Add non-AWS dependencies (Stripe, Twilio, SendGrid, Auth0, etc.)
  • Test Slack/PagerDuty routing with a manual alert


Week 3: Synthetic Monitoring

  • Build 3 lightweight canary scripts (EC2, S3, Lambda)
  • Deploy canaries to run every 60 seconds from primary + failover regions
  • Test alert routing when a canary fails (disable service temporarily)
  • Document runbook: "If canary X fails, do Y"


Week 4: Integration & Runbooks

  • Map all external dependencies (list every third-party service your app calls)
  • Build dependency tier chart (Critical, High, Medium, Low)
  • Document cross-region failover procedure
  • Run a mock incident: "us-east-1 EC2 is down, simulate failover"
  • Update on-call runbook with correct alert sources (IsDown + CloudWatch, not Health Dashboard)

If you'd rather skip building the aggregation layer yourself, IsDown monitors 6,000+ services — including all AWS regions and the third-party dependencies your infrastructure relies on. Slack, PagerDuty, and Datadog integrations included. Most teams are set up in under 10 minutes.


Frequently Asked Questions

Does IsDown replace the AWS Health Dashboard?

No. IsDown augments the Health Dashboard. IsDown is faster and covers more services (AWS + third-party), but the Health Dashboard is still your authoritative source for AWS's official status. Use IsDown as your primary early-warning system, then cross-reference with the Health Dashboard to get official AWS communication.

Should we set up alerts for every AWS service?

No. Alert fatigue is as dangerous as missing real alerts. Tier your services: Critical services (EC2, RDS, S3 if you depend on them) get instant notifications. Medium-tier services go to a Slack channel. Low-tier services get weekly reviews. Most teams over-monitor and tune out real alerts.

What's the best way to monitor CloudFront?

CloudFront outages are rare but brutal because they're often regional and difficult to diagnose. Monitor: CloudFront health events in EventBridge, origin error rate in CloudWatch, cache hit ratio for your distributions, and a synthetic request from a popular origin point. If your distribution goes dark, you need to know in seconds, not minutes.

Can we rely on AWS Health events alone?

No. Health events are sent when AWS formally acknowledges an issue. During the early stages of an outage, before AWS posts a status, your canaries will have already detected it. Use Health events as a confirmation signal, not as your primary detection mechanism.

How do we know if an outage is hitting us specifically vs. the whole region?

Run canaries from your infrastructure. If your canary in us-east-1 fails but IsDown shows EC2 us-east-1 operational, the problem is specific to your application or account, not AWS-wide. This distinction is critical for on-call decision-making.

What metrics should we alarm on?

Start with these four: Error rate (API 5xx responses), latency (p99 and p95), request volume (sudden drops indicate cascading failures), and custom business metrics (failed transactions, unprocessed jobs). For AWS-specific: check API rate limit headroom, DynamoDB throttling, and RDS connection pool saturation.

Nuno Tomas Nuno Tomas Founder of IsDown

Stop wasting hours on 'is it us or them?'

Unified vendor dashboard

Early Outage Detection

Stop the Support Flood

14-day free trial • No credit card required

Related articles

Never again lose time looking in the wrong place

14-day free trial · No credit card required · No code required