Use cases
Software Products E-commerce MSPs Schools Development & Marketing DevOps Agencies Help Desk
Company
Internet Status Blog Pricing Log in Get started free

MTTD vs MTTR: Why Detection Time Kills Your SLA First

Published at Sep 2, 2025.
MTTD vs MTTR: Why Detection Time Kills Your SLA First

TL;DR: MTTD (Mean Time to Detect) and MTTR (Mean Time to Repair): both are critical reliability metrics, but most engineering teams pour all their energy into MTTR and ignore MTTD. The problem? Every minute you spend not knowing about an incident is a minute of unrecoverable downtime. Fix detection first, then optimize resolution.

The Metric Everyone Obsesses Over - and the One That Actually Costs More

Every post-mortem ends the same way: "We need to improve our incident response process." Teams buy tooling, invest in runbooks, practice chaos engineering, and shave minutes off resolution time. Meanwhile, the incident sat undetected for 45 minutes before anyone even opened a ticket.

MTTD vs MTTR isn't a question of which one matters. Both do. But the resource allocation is wildly lopsided, and that imbalance is costing teams real SLA points.

What Is MTTD (Mean Time to Detect)?

MTTD measures the average time between when an incident begins and when your team first becomes aware of it.

MTTD = Total Detection Time / Number of Incidents

Detection can come from multiple sources:

  • Internal monitoring and alerting systems
  • Customer support tickets and user complaints
  • Social media and community reports
  • Vendor status page updates
  • Colleagues noticing something feels off

The weakest link in that list? Vendor status pages, which consistently lag real outages by 15–45 minutes.

What Is MTTR (Mean Time to Repair)?

MTTR measures the average time from when an incident is detected to when the system is fully restored.

MTTR = Total Repair Time / Number of Incidents

MTTR covers the full resolution lifecycle:

  • Incident triage and escalation
  • Root cause identification
  • Implementing a fix or workaround
  • Verifying restoration and monitoring for recurrence
  • Communicating resolution to stakeholders

MTTD vs MTTR: How They Work Together

PhaseMetricWhat It CoversTypical Ownership
DetectionMTTDIncident start → team awarenessObservability, monitoring
ResolutionMTTRAwareness → system restoredIncident response, engineering

Total incident duration = MTTD + MTTR. You cannot optimize total downtime without addressing both.

If your MTTD is 40 minutes and your MTTR is 20 minutes, your total incident time is 60 minutes. The floor on your total downtime is set entirely by detection speed.

The Hard Truth: You can have a world-class incident response process (sub-5-minute MTTR, perfect runbooks, automated rollbacks) and still blow through your SLA budget because detection is slow. MTTD sets the ceiling on how good your reliability can actually get.

Why MTTD Gets Ignored

  • Harder to measure. MTTR shows up cleanly in your incident tracking system. MTTD requires knowing when the incident actually started, which means correlating logs and making educated guesses from post-mortem timelines.
  • Feels less controllable. MTTR is a process problem you can train for. MTTD can feel like luck.
  • Tooling bias. The incident management industry has built incredible tools for response coordination. Detection tooling, especially for third-party dependencies, is far less mature.
  • Post-mortems start from awareness, not incident start. The pre-detection gap often goes entirely unanalyzed.

The Hidden MTTD Killer: Vendor Status Page Lag

Modern applications depend on dozens of third-party services. When one vendor has an incident, your application breaks, but you're not the one who detects it first, and you're not the one who fixes it.

Vendor status pages are notoriously slow to update:

  • Vendors investigate internally before posting anything, often 15–45 minutes of silence
  • Initial posts are frequently vague: "We are investigating reports of elevated error rates"
  • Some vendors under-report severity while impact is still being assessed
  • Status page updates are often driven by comms teams, adding further lag

If your only signal for third-party incidents is "someone notices something's wrong and checks the vendor's status page," your MTTD for those incidents is however long it takes to connect the symptom to the cause. That's often 30–60 minutes.

Putting Numbers to It

To illustrate the impact, consider a team experiencing 10 incidents per month with a 45-minute MTTD and 20-minute MTTR:

ScenarioMTTDMTTRTotal Per IncidentMonthly Downtime (10 incidents)
Current state45 min20 min65 min650 min
MTTR optimized only45 min10 min55 min550 min
MTTD optimized only5 min20 min25 min250 min
Both optimized5 min10 min15 min150 min

Cutting MTTR in half saves 100 minutes per month. Cutting MTTD from 45 to 5 minutes saves 400 minutes.

How to Actually Improve MTTD

For your own infrastructure:

  • Synthetic monitoring: Sub-minute polling intervals on critical endpoints catch failures before users do.
  • Anomaly detection: Error rate and latency alerts that fire before thresholds become obvious give you a head start on triage.
  • Actionable alerting: Pages people directly. Don't rely on dashboards nobody watches during an incident.
  • Post-mortem discipline: Track incident start time explicitly so you can measure MTTD over time and spot patterns.

For third-party dependencies:

  • Automate status page monitoring: Stop relying on manual checks. The moment a vendor posts an update, you should already know.
  • Use dedicated tooling: A tool that continuously watches vendor status pages and fires alerts the moment a vendor reports an issue eliminates the manual lookup entirely.
  • Unified alert routing: Route vendor alerts through the same channels as your own infrastructure alerts so nothing falls through the cracks.
  • Separate your tracking: Measure MTTD for internal vs. third-party incidents independently. The gap will surprise you.

How to Actually Improve MTTR

  • Runbooks: Write them for your most common incident types and keep them current: a stale runbook is worse than none.
  • Automation: Automate rollbacks and common remediation steps wherever the blast radius is understood.
  • Practice: Run tabletop exercises for incident response. Muscle memory matters under pressure.
  • Escalation paths: Reduce friction so the right person is reachable in under two minutes.
  • Clear definition of "resolved": Define it upfront so you don't spend 20 minutes debating when to close the incident.

Pro-Tip: Track MTTD separately for internal vs. third-party incidents. You'll almost certainly find that third-party detection time is 3–5x worse. That's where the fastest wins are, and connecting IsDown to your Slack workspace makes a measurable difference with almost zero setup time.

Frequently Asked Questions

What's the difference between MTTD and MTTA?

MTTA (Mean Time to Acknowledge) measures how quickly someone on your team acknowledges an alert after it fires. MTTD measures the gap before the alert fires at all. You can have an excellent MTTA of 2 minutes and a terrible MTTD of 40 minutes if your alerting doesn't fire until well into the incident.

What's a good benchmark for MTTD?

Most teams, when they first measure MTTD honestly, find their actual numbers sit between 20 and 60 minutes. This is especially true for vendor outages, where detection depends entirely on someone manually checking a status page.

Should I prioritize improving MTTD or MTTR first?

Measure both before deciding. If your MTTR is already under 15 minutes, further MTTR investment has diminishing returns. If you've never systematically tracked MTTD for third-party incidents, that's almost always the fastest win with the highest impact on total downtime.

How does monitoring vendor status pages reduce MTTD?

Vendors typically know about their own incidents before they post publicly, but that internal-to-public lag is where your MTTD hides. Tools that continuously poll vendor status pages and detect changes within seconds cut that lag dramatically. Combined with automated alerting into your incident workflow, you eliminate the manual "check all the status pages" step that adds 20–40 minutes to every third-party incident.

Is MTTD part of MTTR?

It depends on how you define MTTR. Some teams define MTTR as the full time from incident start to resolution, in which case MTTD is a component of MTTR. Others define MTTR as the time from detection to resolution. Either convention works, but it's worth being explicit. If your MTTR clock starts at detection, you need to track MTTD separately to understand your full incident exposure.

Nuno Tomas Nuno Tomas Founder of IsDown

For IT Managers

Monitor all your dependencies in one place

One dashboard to check all status pages

A bird's eye view of all your services in one place.

Get alerts when your vendors are down

Notifications in Slack, Datadog, PagerDuty, etc.

Related articles

Never again lose time looking in the wrong place

14-day free trial · No credit card required · No code required