Use cases
Software Products E-commerce MSPs Schools Development & Marketing DevOps Agencies Help Desk
Company
Internet Status Blog Pricing Log in Get started free

Error Budget in SRE: The Complete Guide (2026)

Published at May 20, 2026.
Error Budget in SRE: The Complete Guide (2026)

TL;DR: Error budgets are the operationalized form of your SLO: they quantify exactly how much unreliability you can afford in a given window. A 99.9% availability SLO gives you 43.2 minutes of downtime per month — that's your budget. Burn it fast, freeze feature work. Burn it slow, ship faster. The system only works if you measure burn rate in real time, hold vendors accountable (their downtime counts against your budget too), and enforce the policy without exceptions.

What an Error Budget Actually Is

An error budget is the acceptable amount of unreliability permitted by your SLO over a defined time window. It is not a target. It is not a stretch goal. It is a hard ceiling that, when breached, should trigger a pre-agreed organizational response — feature freezes, postmortems, or infrastructure investment.

The formula is blunt:

Error Budget = 1 - SLO Target Error Budget (time) = (1 - SLO Target) × Window Duration

For a 30-day window:

  • 99.9% SLO → 0.1% budget → 43.2 minutes of allowable downtime
  • 99.95% SLO → 0.05% budget → 21.6 minutes
  • 99.99% SLO → 0.01% budget → 4.32 minutes

That last number should make you uncomfortable. Four minutes across an entire month. A single bad deploy, a flapping dependency, one vendor going dark — and you're over budget before your on-call engineer has finished reading the alert.

How Error Budget Calculation Works in Practice

The SLO error budget calculation requires three inputs:

  1. The SLO target — what percentage of requests/interactions must succeed
  2. The measurement window — rolling 30 days is the SRE standard; calendar month introduces edge cases
  3. The SLI — the specific metric being measured (availability, latency, correctness)

For a request-based SLI:

Error Budget (requests) = Total Requests × (1 - SLO Target) Budget Remaining = Error Budget - Bad Requests Observed

Example: 10 million requests in a 30-day window with a 99.9% SLO:

  • Total error budget: 10,000 bad requests
  • If 3,200 bad requests have occurred, budget remaining: 6,800 requests (68%)

For time-based SLIs (availability):

Error Budget Consumed = Downtime Minutes / Total Minutes in Window Budget Remaining % = 1 - (Downtime Minutes / Budget Minutes)

Burn Rate: The Metric That Actually Matters

Raw budget remaining is a lagging indicator. Burn rate is the signal. Burn rate measures how fast you're consuming your error budget relative to how fast you're earning it.

Burn Rate = Error Rate / (1 - SLO Target) Where Error Rate = bad requests / total requests over the measurement window.

At a burn rate of 1.0, you're consuming your budget at exactly the rate it accrues — you'll hit 0% remaining at the end of the window with nothing to spare. That's already bad.

At a burn rate of 14.4 over a 1-hour window, you're consuming 2% of your monthly error budget per hour — enough to warrant an immediate page.

Burn Rate Alert Thresholds (Best Practice)

Google's SRE workbook recommends a multi-tier alerting strategy:

  • Page immediately when burn rate > 14.4 over a 1-hour window (2% of monthly budget consumed in 1 hour)
  • Ticket/warning when burn rate > 6 over a 6-hour window (5% of monthly budget consumed in 6 hours)
  • Weekly review when burn rate > 1 over a 3-day window (10% of monthly budget consumed in 3 days)

The Third-Party Problem Nobody Budgets For

Here's what the textbooks don't tell you: your error budget burns whether the outage is your fault or not.

When Stripe goes down, your payment flow fails. When your cloud provider's managed database flaps, your API returns 503s. When your CDN has a regional incident, your users experience timeouts. Every one of those bad requests counts against your SLO. Your SLO doesn't have a "vendor caused it" exemption clause — and neither do your users.

The failure mode looks like this:

  1. Third-party vendor silently degrades at 3 AM
  2. Their status page shows green ("All Systems Operational")
  3. Your alerts fire 8-12 minutes later when your own monitoring catches the tail latency spike
  4. You've already burned 15-20% of your monthly budget while your on-call engineer was asleep

The mitigation is external, independent monitoring — not trusting the vendor to tell you they're broken. Tools like IsDown's status page aggregator (https://isdown.app/status-page-aggregator-as-a-service) monitor hundreds of third-party services in real time and alert you the moment a dependency degrades, before your own systems surface the downstream impact. This isn't a nice-to-have for teams with aggressive SLOs — it's table stakes.

The Error Budget Policy: Where Theory Meets Reality

An error budget without a policy is just a number on a dashboard that engineers ignore when deadlines are tight. The policy is the mechanism that forces the tradeoff between reliability and velocity into the open.

A minimal, enforceable policy has three components:

1. The Trigger Conditions

  • Budget > 50% remaining: Normal operations. Ship features, take calculated risks, run experiments.
  • Budget 25-50% remaining: Caution mode. No high-risk deploys. Reliability work gets prioritized in sprint planning.
  • Budget < 25% remaining: Feature freeze. Engineering time redirects to reliability work until budget recovers or the window resets.
  • Budget exhausted: Incident review required before any new feature work resumes.

2. The Decision Authority

Who can override a feature freeze? In practice, the answer should be: almost nobody. If the VP of Engineering can wave away the policy every time a release is under pressure, the policy has zero value. Define the override process — and make it expensive enough that it's used rarely.

3. The Review Cadence

Budget reviews should happen weekly, not monthly. A monthly review means you discover you've been over budget for three weeks. A weekly review lets you course-correct before the situation becomes a postmortem.

Structuring Error Budgets for Multiple Services

In a microservices architecture, you'll have dozens of services, each with their own SLOs. The practical question: should each service have an independent error budget, or do you roll up to a user-facing composite SLO?

Best Practice: Both. Maintain per-service SLOs for engineering accountability, and maintain a user-facing composite SLO (often called a "journey SLO" or "customer SLO") for executive reporting and policy enforcement.

The user-facing composite SLO is what drives the error budget policy. Individual service SLOs drive team-level prioritization.

Integrating Error Budget Monitoring Into Your Alerting Stack

Burn rate alerts are only useful if they reach the right people at the right time. The operational layer matters as much as the math.

For teams using PagerDuty, burn rate alerts should map to your existing escalation policies: fast burn (>14.4x) goes to the on-call engineer immediately; slow burn (>6x over 6 hours) can route to a lower-urgency queue. Connecting your monitoring to PagerDuty (https://isdown.app/integrations/pagerduty) means burn rate spikes — including those caused by third-party dependency failures — trigger your existing workflows without requiring a separate tool stack.

The architecture recommendation:

  1. SLI measurement — your observability platform (Datadog, Prometheus, etc.) calculates error rate
  2. Burn rate calculation — alerting rules compute burn rate from error rate and SLO target
  3. Budget tracking — a dedicated SLO tool or dashboard tracks budget remaining across the window
  4. Dependency monitoring — independent external monitoring (not your own infra) watches third-party services
  5. Escalation — unified alerting routes all signals to the appropriate on-call channel

The dependency monitoring layer is the one most teams skip. It's also the one that catches the incidents that blindside you in the middle of the night when a payment processor goes dark and your status page still shows green.

Error Budget SRE Maturity: Where Most Teams Actually Are

Be honest about where your organization sits:

Level 0 — No SLOs: Reliability is vibes-based. Incidents are declared when someone important complains.

Level 1 — SLOs exist, no policy: You have a dashboard that shows error budget. Nobody changes behavior based on it. Feature freezes have never happened.

Level 2 — Policy exists, rarely enforced: The policy is in the runbook. It gets overridden when sprint pressure is high. The error budget review is a 10-minute agenda item that gets cut when the meeting runs long.

Level 3 — Policy is enforced, third-party risk is unaccounted for: Feature freezes happen. But vendor downtime still blindsides you, and the root cause analysis of half your budget overruns is "dependency X was degraded."

Level 4 — Full stack: Policy enforced, vendor dependencies monitored independently, burn rate alerts are multi-window, and the SLO target was set based on actual dependency reliability math — not wishful thinking.

In our experience, most teams are at Level 1 or 2. Level 3 is achievable in a quarter. Level 4 requires organizational will and the right tooling.

Frequently Asked Questions

What's the difference between an error budget and an SLA?

An SLA (Service Level Agreement) is a contractual commitment to customers, usually with financial penalties for violations. An error budget is an internal operational tool derived from your SLO — it quantifies how much unreliability you can absorb in a window. SLAs are typically set at lower thresholds than your SLOs (e.g., 99.5% SLA, 99.9% SLO) to give you a buffer. Breaching an error budget is an internal trigger for engineering action. Breaching an SLA is a customer-facing contractual failure.

Can we exclude vendor-caused downtime from our error budget?

Contractually, with customers, sometimes. Operationally, you shouldn't. Your users experienced the failure whether Stripe caused it or you caused it. Excluding vendor downtime from your internal error budget creates a false sense of reliability and leaves you unable to prioritize vendor monitoring as a reliability investment. Track it separately for attribution purposes, but count it against the budget.

How do we handle planned maintenance windows?

Planned maintenance should be accounted for explicitly. Some organizations exclude maintenance windows from SLO calculations entirely; others count them but budget for them intentionally. The standard approach: schedule maintenance when error budget is healthy (>50% remaining), avoid scheduling during high-traffic periods, and always communicate windows via your status page in advance. If maintenance causes more downtime than budgeted, it eats into your error budget like any other incident.

What if our SLO target is unrealistic?

Reset it. An SLO you consistently miss is worse than no SLO — it trains your team to ignore reliability metrics. The right SLO is the highest target you can realistically maintain while shipping features at your current velocity. Start with your historical reliability data, subtract vendor dependency risk, and set a target you can defend. You can always tighten it later.

How do we get buy-in from product and engineering leadership for error budget policies?

Quantify the cost of reliability failures in terms leadership cares about: customer churn, SLA credits issued, engineering time spent on incidents. Then present the error budget policy as the mechanism that prevents those costs — not as a constraint on velocity, but as the signal that tells you when to invest in reliability vs. when it's safe to ship fast. The framing matters: error budgets are not a speed limiter. They're a speed enabler when they're healthy.

Nuno Tomas Nuno Tomas Founder of IsDown

For IT Managers

Monitor all your dependencies in one place

One dashboard to check all status pages

A bird's-eye view of all your services in one place.

Get alerts when your vendors are down

Notifications in Slack, Datadog, PagerDuty, etc.

Related articles

Never again lose time looking in the wrong place

14-day free trial · No credit card required · No code required