How to Reduce MTTR When Third-Party Services Go Down

Published at Apr 1, 2026.

TL;DR: Most MTTR guides assume the problem is in your infra. For modern apps, it's often not — it's Stripe, AWS, Auth0, or another vendor. Vendor status pages lie by omission. The lag between impact and acknowledgment can stretch to an hour or more. You need two runbooks, proactive vendor monitoring, and graceful degradation baked in before the 3 AM page hits. This post shows you exactly how.

Your on-call phone goes off at 3:17 AM. Payments are failing. You ssh in, check your pods — all green. Database? Healthy. Load balancer? Fine. You spend 22 minutes chasing ghosts before someone checks Stripe's status page and sees the incident that started 34 minutes ago.

Those 22 minutes are pure waste, and they're exactly the kind of MTTR you can reduce without touching a single line of your own code. And the fix isn't faster debugging. It's recognizing that the failure wasn't yours to debug.

The Wrong Mental Model Killing Your MTTR

The Dependency Trap: Every MTTR framework you'll find is built around a core assumption: the failure lives in infrastructure you control. That assumption made sense in 2010 when your stack was three servers in a rack. It's a liability.

Modern organizations rely on 100+ SaaS tools, and each application you build typically integrates with dozens of them. Payment processors. Auth providers. CDNs. Email delivery. Observability tooling. Every single one of those is a potential outage source that doesn't show up in your dashboards, doesn't trigger your alerts, and doesn't respond to your kubectl commands.

The Hard Truth: When Stripe, AWS us-east-1, or Auth0 goes down, your standard incident runbook is actively counterproductive. You're spending MTTR budget looking for a fire that's not in your building. The teams with the best MTTR on vendor incidents aren't better debuggers. They're better at recognizing vendor failures before they start debugging.

Why Vendor Status Pages Are Useless for MTTR

This needs saying plainly: vendor status pages are PR tools, not monitoring tools.

The lag between when a vendor incident starts affecting customers and when it appears on their status page can range from 15 minutes to over an hour, and for major incidents, it's often closer to the latter.

Status Page Green-Washing: Vendors have a strong incentive to show green. Some vendors won't post an incident until they're confident they can post a resolution shortly after, meaning you'll see green right up until they post "Incident resolved" with no intermediate warning.

Pro-Tip: Never manually check vendor status pages during an incident. By the time you're checking, you're already behind. You need automated monitoring that detects vendor degradation from your perspective — not theirs.

This is exactly the problem IsDown solves. Rather than relying on vendors to self-report, IsDown aggregates status pages and cross-references multiple data sources to detect outages before the vendor acknowledges them, which means you get alerted minutes before your customers notice anything is wrong.

The Error Budget Math You Need to Internalize

You're running at a 99.9% uptime SLO. That gives you approximately 43.8 minutes of downtime per month.

Stripe has a 30-minute partial outage affecting payment processing. Your checkout is broken. You detect it in 8 minutes (optimistic), identify it as a Stripe issue in 5 more. Total customer-facing downtime: ~30 minutes.

That single vendor incident consumed 68% of your entire monthly error budget. And you had zero control over it.

The Hard Truth: Your SLO is a joint venture with every vendor in your stack. A 99.9% SLO with Stripe, Auth0, and SendGrid as dependencies isn't really 99.9% — it's the product of all their uptime numbers multiplied together.

Two Runbooks, Not One

Runbook 1: We Are Down (Internal Investigation)

Check infrastructure health (pods, nodes, databases, caches)
Review recent deploys and config changes
Check error rates and latency by service
Hard time limit: pivot to Runbook 2 if no root cause in 10 minutes

Runbook 2: Vendor Is Down (Containment + Comms)

Cross-reference vendor monitoring alerts (IsDown, status page, social media)
Identify affected vendor and specific services
Activate graceful degradation for the affected service
Draft and send initial customer communication (vendor-cause template)
Set monitoring cadence for vendor recovery

The 10-Minute Pivot Rule: If your infra looks healthy and you can't find the cause in 10 minutes, assume it's a vendor until proven otherwise. This single rule change can cut 15–20 minutes off your MTTR on vendor incidents.

Communication Templates for Vendor-Caused Incidents

Initial Notification (within 5 minutes of identification)

We are currently experiencing [service] degradation. Our investigation indicates this is caused by an ongoing incident at [Vendor]. We are monitoring their status actively. Our team is assessing workarounds. Next update in 15 minutes.

Update Template (every 15 minutes until resolved)

[Vendor] incident is ongoing. [Affected feature] remains degraded. [Workaround if available.] We expect resolution [timeframe if known]. Next update in 15 minutes.

Resolution Template

[Vendor] has resolved their incident. [Service] is returning to normal. Total impact duration: [X minutes]. A post-incident summary will follow.

Proactive Vendor Monitoring: Detect Before the Page

What reactive looks like: Customer reports an issue → engineer investigates → engineer checks Stripe status page → realizes it's a Stripe outage. Time lost: 15–30 minutes.

What proactive looks like: IsDown detects Stripe degradation → alert fires to PagerDuty → engineer picks up incident already knowing it's Stripe. Time lost: 2–3 minutes.

IsDown integrates directly with PagerDuty, so when a vendor you're monitoring goes degraded, it fires directly into your existing on-call workflow. No new tool to check. No manual status page polling.

For teams that run incident comms through Slack, IsDown's Slack integration posts vendor status updates automatically to your incident channels.

Graceful Degradation: Engineering Your Way to Lower MTTR

Payment Processing

Maintain a fallback payment processor (Stripe primary, Braintree/Adyen secondary)
Queue failed payment attempts for retry rather than surfacing hard errors
For B2B: offer invoice-based fallback during processor outages

Authentication

Cache auth tokens with appropriate TTLs — don't hit Auth0 on every request
For session-based auth: extend session validity during auth provider outages
Build a read-only mode that works off cached identity data

Email/Notifications

Queue outbound emails — never send synchronously in the request path
Maintain a secondary SMTP provider for critical transactional email

Pro-Tip: Graceful degradation isn't just a resilience pattern — it's an SLO strategy. If Stripe goes down and your checkout queues failed payments instead of erroring, you haven't had a customer-facing outage. That vendor incident never touches your error budget.

The Async-First Principle: Any operation that touches a third-party service should be async by default if the use case allows it. Async operations can be retried. Synchronous failures in the critical path become your outage.

Post-Incident: Close the Feedback Loop

Every incident record should include a failure source tag: - Internal: Bug, deployment, configuration, capacity - Vendor: Third-party service outage or degradation - Infrastructure: Cloud provider, CDN, DNS, network

Over 6 months, this data tells you what's actually causing your downtime. Many teams discover, after tagging failure source for the first time, that a significant portion of their customer-facing incidents trace back to vendor issues. Not internal failures.

The Hard Truth: If you're not tagging failure source in your incident records, you're optimizing your on-call process based on incomplete data. You might be over-investing in internal resilience while vendor failures quietly eat your error budget.

Putting It Together: Your MTTR Reduction Checklist

Deploy proactive vendor monitoring (IsDown or equivalent) — stop relying on status pages
Split your incident runbook into 'internal investigation' and 'vendor containment' tracks
Implement the 10-minute pivot rule: clean infra + no root cause = assume vendor
Pre-write vendor incident communication templates — don't draft under pressure
Build graceful degradation for your highest-risk vendor dependencies
Tag failure source in every incident record — internal, vendor, or infra
Review vendor-caused incidents quarterly and invest in fallback coverage accordingly

FAQ

What's a realistic MTTR target for vendor-caused incidents?

With proactive monitoring and a pre-built vendor runbook, most teams can get detection under 5 minutes and time-to-acknowledge under 10. Total MTTR (including customer communication) should be under 20 minutes for a well-prepared team.

How is IsDown different from just following vendor status pages?

IsDown monitors vendor endpoints independently and detects degradation from the outside, before the vendor self-reports. In practice, that means you get alerted well before the status page turns red, often by 15 minutes or more. Over the course of a year with active vendor dependencies, that gap adds up to hours of unnecessary MTTR.

Should vendor incidents go through the same incident management process as internal ones?

Yes, with a parallel track. Same severity classification and communication cadence. What changes is the investigation path: vendor incidents trigger containment and graceful degradation, not root cause analysis in your stack.

How do I build a business case for investing in graceful degradation?

Use your incident history and error budget math. If Stripe went down twice last year for 20 minutes each, calculate the direct revenue impact. Add engineer time for incident response. Stack that against the engineering cost of a fallback processor or payment queue. The ROI is usually obvious within one or two incidents.

What vendors should I prioritize monitoring first?

Start with your critical path: (1) payment processors, (2) auth/identity providers, (3) primary cloud provider key services, (4) CDN/DNS providers, (5) communication services in the critical path (e.g., OTP SMS). Monitor anything where a 10-minute outage would breach your SLO or trigger a customer escalation.

Nuno Tomas Founder of IsDown

For IT Managers

Monitor all your dependencies in one place

One dashboard with all vendors statuses

A bird's-eye view of all your services in one place.

Get alerts when your vendors are down

Notifications in Slack, Datadog, PagerDuty, etc.

Start Free Trial

Sep 30, 2025

Top 10 Reasons Why You Need a Status Page Aggregator

Discover why a status page aggregator is essential for monitoring multiple vendors. Learn how to centralize alerts and improve incident response.

Jun 16, 2026

IsDown is joining UptimeRobot

IsDown has been acquired by UptimeRobot. Your plan, login, and data stay the same. Here's what's changing, what isn't, and the legal details.

May 20, 2026

Error Budget in SRE: The Complete Guide (2026)

Error budgets translate your SLO into a measurable allowance for failure. Learn how to calculate, defend, and spend your error budget - and why vendor outages silently drain it.

May 13, 2026

Cloud Outage History: Six Years of Recurring Failures

Six years of major cloud outages dissected - AWS, Cloudflare, CrowdStrike and more. Root causes, failure patterns, and what SRE teams keep getting wrong.

May 3, 2026

April 2026: IsDown Users Saved 16.5 Hours with Early Outage Detection

IsDown detected 45 outages up to 3.6 hours before vendors acknowledged them in April 2026, plus 104 incidents vendors never reported.

Apr 22, 2026

AWS Outage History: What Engineering Teams Should Learn

AWS outage history follows a predictable pattern: us-east-1, cascade failures, status pages that lag 30-90+ minutes. Here's what engineering teams should learn.