TL;DR: MTTD (Mean Time to Detect) and MTTR (Mean Time to Repair): both are critical reliability metrics, but most engineering teams pour all their energy into MTTR and ignore MTTD. The problem? Every minute you spend not knowing about an incident is a minute of unrecoverable downtime. Fix detection first, then optimize resolution.
Every post-mortem ends the same way: "We need to improve our incident response process." Teams buy tooling, invest in runbooks, practice chaos engineering, and shave minutes off resolution time. Meanwhile, the incident sat undetected for 45 minutes before anyone even opened a ticket.
MTTD vs MTTR isn't a question of which one matters. Both do. But the resource allocation is wildly lopsided, and that imbalance is costing teams real SLA points.
MTTD measures the average time between when an incident begins and when your team first becomes aware of it.
MTTD = Total Detection Time / Number of Incidents
Detection can come from multiple sources:
The weakest link in that list? Vendor status pages, which consistently lag real outages by 15–45 minutes.
MTTR measures the average time from when an incident is detected to when the system is fully restored.
MTTR = Total Repair Time / Number of Incidents
MTTR covers the full resolution lifecycle:
| Phase | Metric | What It Covers | Typical Ownership |
|---|---|---|---|
| Detection | MTTD | Incident start → team awareness | Observability, monitoring |
| Resolution | MTTR | Awareness → system restored | Incident response, engineering |
Total incident duration = MTTD + MTTR. You cannot optimize total downtime without addressing both.
If your MTTD is 40 minutes and your MTTR is 20 minutes, your total incident time is 60 minutes. The floor on your total downtime is set entirely by detection speed.
The Hard Truth: You can have a world-class incident response process (sub-5-minute MTTR, perfect runbooks, automated rollbacks) and still blow through your SLA budget because detection is slow. MTTD sets the ceiling on how good your reliability can actually get.
Modern applications depend on dozens of third-party services. When one vendor has an incident, your application breaks, but you're not the one who detects it first, and you're not the one who fixes it.
Vendor status pages are notoriously slow to update:
If your only signal for third-party incidents is "someone notices something's wrong and checks the vendor's status page," your MTTD for those incidents is however long it takes to connect the symptom to the cause. That's often 30–60 minutes.
To illustrate the impact, consider a team experiencing 10 incidents per month with a 45-minute MTTD and 20-minute MTTR:
| Scenario | MTTD | MTTR | Total Per Incident | Monthly Downtime (10 incidents) |
|---|---|---|---|---|
| Current state | 45 min | 20 min | 65 min | 650 min |
| MTTR optimized only | 45 min | 10 min | 55 min | 550 min |
| MTTD optimized only | 5 min | 20 min | 25 min | 250 min |
| Both optimized | 5 min | 10 min | 15 min | 150 min |
Cutting MTTR in half saves 100 minutes per month. Cutting MTTD from 45 to 5 minutes saves 400 minutes.
For your own infrastructure:
For third-party dependencies:
Pro-Tip: Track MTTD separately for internal vs. third-party incidents. You'll almost certainly find that third-party detection time is 3–5x worse. That's where the fastest wins are, and connecting IsDown to your Slack workspace makes a measurable difference with almost zero setup time.
MTTA (Mean Time to Acknowledge) measures how quickly someone on your team acknowledges an alert after it fires. MTTD measures the gap before the alert fires at all. You can have an excellent MTTA of 2 minutes and a terrible MTTD of 40 minutes if your alerting doesn't fire until well into the incident.
Most teams, when they first measure MTTD honestly, find their actual numbers sit between 20 and 60 minutes. This is especially true for vendor outages, where detection depends entirely on someone manually checking a status page.
Measure both before deciding. If your MTTR is already under 15 minutes, further MTTR investment has diminishing returns. If you've never systematically tracked MTTD for third-party incidents, that's almost always the fastest win with the highest impact on total downtime.
Vendors typically know about their own incidents before they post publicly, but that internal-to-public lag is where your MTTD hides. Tools that continuously poll vendor status pages and detect changes within seconds cut that lag dramatically. Combined with automated alerting into your incident workflow, you eliminate the manual "check all the status pages" step that adds 20–40 minutes to every third-party incident.
It depends on how you define MTTR. Some teams define MTTR as the full time from incident start to resolution, in which case MTTD is a component of MTTR. Others define MTTR as the time from detection to resolution. Either convention works, but it's worth being explicit. If your MTTR clock starts at detection, you need to track MTTD separately to understand your full incident exposure.
Nuno Tomas
Founder of IsDown
For IT Managers
Monitor all your dependencies in one place
One dashboard to check all status pages
A bird's eye view of all your services in one place.
Get alerts when your vendors are down
Notifications in Slack, Datadog, PagerDuty, etc.
14-day free trial · No credit card required · No code required