TL;DR: Most MTTR guides assume the problem is in your infra. For modern apps, it's often not — it's Stripe, AWS, Auth0, or another vendor. Vendor status pages lie by omission. The lag between impact and acknowledgment can stretch to an hour or more. You need two runbooks, proactive vendor monitoring, and graceful degradation baked in before the 3 AM page hits. This post shows you exactly how.
Your on-call phone goes off at 3:17 AM. Payments are failing. You ssh in, check your pods — all green. Database? Healthy. Load balancer? Fine. You spend 22 minutes chasing ghosts before someone checks Stripe's status page and sees the incident that started 34 minutes ago.
Those 22 minutes are pure waste, and they're exactly the kind of MTTR you can reduce without touching a single line of your own code. And the fix isn't faster debugging. It's recognizing that the failure wasn't yours to debug.
The Dependency Trap: Every MTTR framework you'll find is built around a core assumption: the failure lives in infrastructure you control. That assumption made sense in 2010 when your stack was three servers in a rack. It's a liability.
Modern organizations rely on 100+ SaaS tools, and each application you build typically integrates with dozens of them. Payment processors. Auth providers. CDNs. Email delivery. Observability tooling. Every single one of those is a potential outage source that doesn't show up in your dashboards, doesn't trigger your alerts, and doesn't respond to your kubectl commands.
The Hard Truth: When Stripe, AWS us-east-1, or Auth0 goes down, your standard incident runbook is actively counterproductive. You're spending MTTR budget looking for a fire that's not in your building. The teams with the best MTTR on vendor incidents aren't better debuggers. They're better at recognizing vendor failures before they start debugging.
This needs saying plainly: vendor status pages are PR tools, not monitoring tools.
The lag between when a vendor incident starts affecting customers and when it appears on their status page can range from 15 minutes to over an hour, and for major incidents, it's often closer to the latter.
Status Page Green-Washing: Vendors have a strong incentive to show green. Some vendors won't post an incident until they're confident they can post a resolution shortly after, meaning you'll see green right up until they post "Incident resolved" with no intermediate warning.
Pro-Tip: Never manually check vendor status pages during an incident. By the time you're checking, you're already behind. You need automated monitoring that detects vendor degradation from your perspective — not theirs.
This is exactly the problem IsDown solves. Rather than relying on vendors to self-report, IsDown monitors their endpoints directly and cross-references multiple data sources to detect outages before the vendor acknowledges them, which means you get alerted minutes before your customers notice anything is wrong.
You're running at a 99.9% uptime SLO. That gives you approximately 43.8 minutes of downtime per month.
Stripe has a 30-minute partial outage affecting payment processing. Your checkout is broken. You detect it in 8 minutes (optimistic), identify it as a Stripe issue in 5 more. Total customer-facing downtime: ~30 minutes.
That single vendor incident consumed 68% of your entire monthly error budget. And you had zero control over it.
The Hard Truth: Your SLO is a joint venture with every vendor in your stack. A 99.9% SLO with Stripe, Auth0, and SendGrid as dependencies isn't really 99.9% — it's the product of all their uptime numbers multiplied together.
The 10-Minute Pivot Rule: If your infra looks healthy and you can't find the cause in 10 minutes, assume it's a vendor until proven otherwise. This single rule change can cut 15–20 minutes off your MTTR on vendor incidents.
We are currently experiencing [service] degradation. Our investigation indicates this is caused by an ongoing incident at [Vendor]. We are monitoring their status actively. Our team is assessing workarounds. Next update in 15 minutes.
[Vendor] incident is ongoing. [Affected feature] remains degraded. [Workaround if available.] We expect resolution [timeframe if known]. Next update in 15 minutes.
[Vendor] has resolved their incident. [Service] is returning to normal. Total impact duration: [X minutes]. A post-incident summary will follow.
What reactive looks like: Customer reports an issue → engineer investigates → engineer checks Stripe status page → realizes it's a Stripe outage. Time lost: 15–30 minutes.
What proactive looks like: IsDown detects Stripe degradation → alert fires to PagerDuty → engineer picks up incident already knowing it's Stripe. Time lost: 2–3 minutes.
IsDown integrates directly with PagerDuty, so when a vendor you're monitoring goes degraded, it fires directly into your existing on-call workflow. No new tool to check. No manual status page polling.
For teams that run incident comms through Slack, IsDown's Slack integration posts vendor status updates automatically to your incident channels.
Pro-Tip: Graceful degradation isn't just a resilience pattern — it's an SLO strategy. If Stripe goes down and your checkout queues failed payments instead of erroring, you haven't had a customer-facing outage. That vendor incident never touches your error budget.
The Async-First Principle: Any operation that touches a third-party service should be async by default if the use case allows it. Async operations can be retried. Synchronous failures in the critical path become your outage.
Every incident record should include a failure source tag: - Internal: Bug, deployment, configuration, capacity - Vendor: Third-party service outage or degradation - Infrastructure: Cloud provider, CDN, DNS, network
Over 6 months, this data tells you what's actually causing your downtime. Many teams discover, after tagging failure source for the first time, that a significant portion of their customer-facing incidents trace back to vendor issues. Not internal failures.
The Hard Truth: If you're not tagging failure source in your incident records, you're optimizing your on-call process based on incomplete data. You might be over-investing in internal resilience while vendor failures quietly eat your error budget.
With proactive monitoring and a pre-built vendor runbook, most teams can get detection under 5 minutes and time-to-acknowledge under 10. Total MTTR (including customer communication) should be under 20 minutes for a well-prepared team.
IsDown monitors vendor endpoints independently and detects degradation from the outside, before the vendor self-reports. In practice, that means you get alerted well before the status page turns red, often by 15 minutes or more. Over the course of a year with active vendor dependencies, that gap adds up to hours of unnecessary MTTR.
Yes, with a parallel track. Same severity classification and communication cadence. What changes is the investigation path: vendor incidents trigger containment and graceful degradation, not root cause analysis in your stack.
Use your incident history and error budget math. If Stripe went down twice last year for 20 minutes each, calculate the direct revenue impact. Add engineer time for incident response. Stack that against the engineering cost of a fallback processor or payment queue. The ROI is usually obvious within one or two incidents.
Start with your critical path: (1) payment processors, (2) auth/identity providers, (3) primary cloud provider key services, (4) CDN/DNS providers, (5) communication services in the critical path (e.g., OTP SMS). Monitor anything where a 10-minute outage would breach your SLO or trigger a customer escalation.
Nuno Tomas
Founder of IsDown
The Status Page Aggregator with Early Outage Detection
Unified vendor dashboard
Early Outage Detection
Stop the Support Flood
14-day free trial · No credit card required · No code required