TL;DR: MTTR (Mean Time to Resolve) is one of the most impactful reliability metrics you can improve. The biggest wins come from reducing detection lag, tightening your alerting signal-to-noise ratio, maintaining living runbooks, and not relying on vendor status pages as your primary source of truth for third-party outages.
If you want to know how to reduce MTTR, you first need to understand what it actually measures. MTTR stands for Mean Time to Resolve (sometimes called Mean Time to Recovery or Mean Time to Repair, depending on context). It measures the average time from when an incident begins to when the system is fully restored.
The formula is simple: MTTR = Total Downtime / Number of Incidents
But that simplicity is deceptive. Most teams track MTTR from the moment someone files a ticket, not from the moment the incident actually started. That gap between incident start and detection is invisible in the metric, but it's often where hours go missing.
The Hard Truth: If you're only measuring MTTR from the moment an alert fires, you're measuring a best-case scenario. Real MTTR includes the time your systems were degraded before anyone knew about it, and that number is usually worse than you think.
To reduce MTTR, you need to understand where time is actually being lost. Every incident moves through four stages:
| Phase | Description | Typical Time Lost |
|---|---|---|
| Detection | Time from incident start to first alert or notice | Minutes to hours |
| Diagnosis | Time to identify root cause and affected components | Minutes to hours |
| Resolution | Time to deploy a fix, rollback, or workaround | Minutes to days |
| Verification | Time to confirm systems are fully restored | Minutes to hours |
Most engineering effort goes into resolution, but the highest-leverage improvements usually sit in detection and diagnosis. A fix you can deploy in 10 minutes still fails if it takes 3 hours to identify the right fix.
Detection lag is the time between when something breaks and when your team knows about it. For teams running complex stacks that depend on third-party services (payment processors, email providers, cloud infrastructure, CDNs), it's often the biggest problem.
Here's the failure mode that repeats itself constantly: a vendor goes down, your monitoring shows elevated errors, someone opens the vendor's status page, and it says "All Systems Operational." Engineers spend 45 minutes debugging their own code before someone checks Twitter and realizes the vendor has been down for an hour.
Vendor status pages are notoriously lagged. Most providers post updates manually and conservatively: they confirm internally before publishing publicly. That lag routinely runs 15–45 minutes behind actual degradation. For your MTTR, that's dead time.
The fix is to monitor vendor health independently. IsDown's PagerDuty integration lets you aggregate status signals from thousands of services and route them directly into your incident workflow, so you're not waiting on a vendor's comms team to update their status page before your team starts triaging.
Alert fatigue is the state where engineers have been conditioned to ignore alerts because too many of them are noise. Every alert in your system should be:
If an alert doesn't meet all four criteria, it shouldn't wake anyone up.
Runbooks are the documentation that tells an on-call engineer what to do when a specific alert fires. When they're missing, incomplete, or out of date, diagnosis time explodes. A runbook needs to answer:
Runbooks rot fast. Treat them like code: review them after every incident, and make updating them part of your incident close process.
When the on-call engineer hits a wall, unclear escalation paths add significant time. Define and test escalation paths before incidents happen, not during them.
Pro-Tip: Run quarterly "escalation drills": simulate an incident and trace the escalation path from alert to resolution. You'll find broken links (wrong numbers, people who've changed roles, runbooks that reference deprecated systems) before they cost you in production.
If your team lives in Slack, your incident workflow should live there too. Every context switch during an incident costs time: an engineer who has to leave the debugging thread to check a vendor status page, open a browser tab, and come back is an engineer who just added unnecessary minutes to your MTTR.
IsDown's Slack integration pushes real-time status updates from monitored services directly into your incident channels, so your engineers have vendor status context without leaving the thread where they're already working the problem. When a vendor degrades, the signal arrives where the response is already happening.
Post-mortems are the primary mechanism for reducing MTTR over time, but only if you run them correctly. A good post-mortem answers:
Best Practice: Run blameless post-mortems. The goal is to understand how your system failed, not to assign fault to an individual. Blame-focused retrospectives cause engineers to become defensive, surface less information, and stop reporting near-misses, which is exactly the opposite of what you need to improve reliability over time.
Anti-Pattern: Closing a post-mortem without action items that have a named owner and a deadline. A post-mortem that ends with "we should improve our alerting" is not a post-mortem. It's a meeting. Every action item needs a single owner, a due date, and a follow-up mechanism. If your incident tracker doesn't enforce this, your post-mortem process has a hole in it.
| Slice | What It Reveals |
|---|---|
| MTTR by service/component | Which parts of your stack consistently take longest to resolve |
| MTTR by incident category | Whether infra, app, or third-party incidents drive the average |
| MTTR by time of day | Whether off-hours incidents are significantly slower |
| MTTR by on-call engineer | Knowledge gaps or runbook gaps affecting specific team members |
Before an incident happens, run through this checklist to audit your coverage of external service dependencies:
MTTD is Mean Time to Detect: the time from when an incident starts to when your team first becomes aware of it. MTTR is Mean Time to Resolve: the full span from incident start to resolution. MTTD is a component of MTTR. Reducing MTTD is often the fastest way to reduce overall MTTR, because it eliminates the silent period where systems are degraded, and no one is working on the problem yet.
Elite DevOps performers (per DORA research) achieve MTTR under one hour. High performers average under one day. Medium and low performers can take days or weeks. For P1 incidents specifically, most mature SRE teams target 30 minutes or less for MTTR, with detection happening within minutes of incident start.
When a vendor causes your outage, the resolution is out of your hands, but detection and communication aren't. The biggest lever is getting vendor status information faster. Most vendor status pages lag reality by 15–45 minutes. Independent monitoring that watches vendor behaviour rather than waiting for their self-reported status can reduce detection time to minutes rather than the 15–45 minutes you'd otherwise lose waiting for a status page update, which alone makes a material difference to your MTTR even when you can't control the resolution.
Not necessarily. The highest-impact MTTR improvements are often process changes: better runbooks, clearer escalation paths, structured post-mortems with action item follow-through. Tools help, but they're most effective when layered on solid fundamentals. Start with the process gaps before adding tooling.
MTTR directly affects your error budget consumption. Every minute of downtime counts against your SLO. Lower MTTR means less error budget burned per incident. If you're regularly exhausting your error budget, improving MTTR is often the most direct lever, especially for teams where individual incidents last hours rather than minutes.
The most reliable signal is a divergence between what you're observing and what the status page reports. If your error rates are elevated, your synthetic checks are failing, or your users are reporting issues while the vendor's status page still shows "All Systems Operational", trust your own data first.
Practical indicators of green-washing: the status page hasn't been updated in over 30 minutes during an active degradation window; the vendor acknowledges an "investigation" but hasn't updated the affected components; or social media (particularly X/Twitter) shows widespread user reports that predate any official acknowledgement.
The structural fix is to stop using vendor status pages as your primary detection mechanism. Monitor vendor behaviour independently: watch for response time increases, timeout spikes, and error rate changes that indicate a problem before it's officially acknowledged. By the time a status page reflects reality, you've already lost the detection window.
Nuno Tomas
Founder of IsDown
For IT Managers
Monitor all your dependencies in one place
One dashboard to check all status pages
A bird's eye view of all your services in one place.
Get alerts when your vendors are down
Notifications in Slack, Datadog, PagerDuty, etc.
14-day free trial · No credit card required · No code required