Error Budget in SRE: The Complete Guide (2026)

Published at May 20, 2026.

TL;DR: Error budgets are the operationalized form of your SLO: they quantify exactly how much unreliability you can afford in a given window. A 99.9% availability SLO gives you 43.2 minutes of downtime per month — that's your budget. Burn it fast, freeze feature work. Burn it slow, ship faster. The system only works if you measure burn rate in real time, hold vendors accountable (their downtime counts against your budget too), and enforce the policy without exceptions.

What an Error Budget Actually Is

An error budget is the acceptable amount of unreliability permitted by your SLO over a defined time window. It is not a target. It is not a stretch goal. It is a hard ceiling that, when breached, should trigger a pre-agreed organizational response — feature freezes, postmortems, or infrastructure investment.

The formula is blunt:

Error Budget = 1 - SLO Target Error Budget (time) = (1 - SLO Target) × Window Duration

For a 30-day window:

99.9% SLO → 0.1% budget → 43.2 minutes of allowable downtime
99.95% SLO → 0.05% budget → 21.6 minutes
99.99% SLO → 0.01% budget → 4.32 minutes

That last number should make you uncomfortable. Four minutes across an entire month. A single bad deploy, a flapping dependency, one vendor going dark — and you're over budget before your on-call engineer has finished reading the alert.

How Error Budget Calculation Works in Practice

The SLO error budget calculation requires three inputs:

The SLO target — what percentage of requests/interactions must succeed
The measurement window — rolling 30 days is the SRE standard; calendar month introduces edge cases
The SLI — the specific metric being measured (availability, latency, correctness)

For a request-based SLI:

Error Budget (requests) = Total Requests × (1 - SLO Target) Budget Remaining = Error Budget - Bad Requests Observed

Example: 10 million requests in a 30-day window with a 99.9% SLO:

Total error budget: 10,000 bad requests
If 3,200 bad requests have occurred, budget remaining: 6,800 requests (68%)

For time-based SLIs (availability):

Error Budget Consumed = Downtime Minutes / Total Minutes in Window Budget Remaining % = 1 - (Downtime Minutes / Budget Minutes)

Burn Rate: The Metric That Actually Matters

Raw budget remaining is a lagging indicator. Burn rate is the signal. Burn rate measures how fast you're consuming your error budget relative to how fast you're earning it.

Burn Rate = Error Rate / (1 - SLO Target) Where Error Rate = bad requests / total requests over the measurement window.

At a burn rate of 1.0, you're consuming your budget at exactly the rate it accrues — you'll hit 0% remaining at the end of the window with nothing to spare. That's already bad.

At a burn rate of 14.4 over a 1-hour window, you're consuming 2% of your monthly error budget per hour — enough to warrant an immediate page.

Burn Rate Alert Thresholds (Best Practice)

Google's SRE workbook recommends a multi-tier alerting strategy:

Page immediately when burn rate > 14.4 over a 1-hour window (2% of monthly budget consumed in 1 hour)
Ticket/warning when burn rate > 6 over a 6-hour window (5% of monthly budget consumed in 6 hours)
Weekly review when burn rate > 1 over a 3-day window (10% of monthly budget consumed in 3 days)

The Third-Party Problem Nobody Budgets For

Here's what the textbooks don't tell you: your error budget burns whether the outage is your fault or not.

When Stripe goes down, your payment flow fails. When your cloud provider's managed database flaps, your API returns 503s. When your CDN has a regional incident, your users experience timeouts. Every one of those bad requests counts against your SLO. Your SLO doesn't have a "vendor caused it" exemption clause — and neither do your users.

The failure mode looks like this:

Third-party vendor silently degrades at 3 AM
Their status page shows green ("All Systems Operational")
Your alerts fire 8-12 minutes later when your own monitoring catches the tail latency spike
You've already burned 15-20% of your monthly budget while your on-call engineer was asleep

The mitigation is external, independent monitoring — not trusting the vendor to tell you they're broken. Tools like IsDown's status page aggregator (https://isdown.app/status-page-aggregator-as-a-service) monitor hundreds of third-party services in real time and alert you the moment a dependency degrades, before your own systems surface the downstream impact. This isn't a nice-to-have for teams with aggressive SLOs — it's table stakes.

The Error Budget Policy: Where Theory Meets Reality

An error budget without a policy is just a number on a dashboard that engineers ignore when deadlines are tight. The policy is the mechanism that forces the tradeoff between reliability and velocity into the open.

A minimal, enforceable policy has three components:

1. The Trigger Conditions

Budget > 50% remaining: Normal operations. Ship features, take calculated risks, run experiments.
Budget 25-50% remaining: Caution mode. No high-risk deploys. Reliability work gets prioritized in sprint planning.
Budget < 25% remaining: Feature freeze. Engineering time redirects to reliability work until budget recovers or the window resets.
Budget exhausted: Incident review required before any new feature work resumes.

2. The Decision Authority

Who can override a feature freeze? In practice, the answer should be: almost nobody. If the VP of Engineering can wave away the policy every time a release is under pressure, the policy has zero value. Define the override process — and make it expensive enough that it's used rarely.

3. The Review Cadence

Budget reviews should happen weekly, not monthly. A monthly review means you discover you've been over budget for three weeks. A weekly review lets you course-correct before the situation becomes a postmortem.

Structuring Error Budgets for Multiple Services

In a microservices architecture, you'll have dozens of services, each with their own SLOs. The practical question: should each service have an independent error budget, or do you roll up to a user-facing composite SLO?

Best Practice: Both. Maintain per-service SLOs for engineering accountability, and maintain a user-facing composite SLO (often called a "journey SLO" or "customer SLO") for executive reporting and policy enforcement.

The user-facing composite SLO is what drives the error budget policy. Individual service SLOs drive team-level prioritization.

Integrating Error Budget Monitoring Into Your Alerting Stack

Burn rate alerts are only useful if they reach the right people at the right time. The operational layer matters as much as the math.

For teams using PagerDuty, burn rate alerts should map to your existing escalation policies: fast burn (>14.4x) goes to the on-call engineer immediately; slow burn (>6x over 6 hours) can route to a lower-urgency queue. Connecting your monitoring to PagerDuty (https://isdown.app/integrations/pagerduty) means burn rate spikes — including those caused by third-party dependency failures — trigger your existing workflows without requiring a separate tool stack.

The architecture recommendation:

SLI measurement — your observability platform (Datadog, Prometheus, etc.) calculates error rate
Burn rate calculation — alerting rules compute burn rate from error rate and SLO target
Budget tracking — a dedicated SLO tool or dashboard tracks budget remaining across the window
Dependency monitoring — independent external monitoring (not your own infra) watches third-party services
Escalation — unified alerting routes all signals to the appropriate on-call channel

The dependency monitoring layer is the one most teams skip. It's also the one that catches the incidents that blindside you in the middle of the night when a payment processor goes dark and your status page still shows green.

Error Budget SRE Maturity: Where Most Teams Actually Are

Be honest about where your organization sits:

Level 0 — No SLOs: Reliability is vibes-based. Incidents are declared when someone important complains.

Level 1 — SLOs exist, no policy: You have a dashboard that shows error budget. Nobody changes behavior based on it. Feature freezes have never happened.

Level 2 — Policy exists, rarely enforced: The policy is in the runbook. It gets overridden when sprint pressure is high. The error budget review is a 10-minute agenda item that gets cut when the meeting runs long.

Level 3 — Policy is enforced, third-party risk is unaccounted for: Feature freezes happen. But vendor downtime still blindsides you, and the root cause analysis of half your budget overruns is "dependency X was degraded."

Level 4 — Full stack: Policy enforced, vendor dependencies monitored independently, burn rate alerts are multi-window, and the SLO target was set based on actual dependency reliability math — not wishful thinking.

In our experience, most teams are at Level 1 or 2. Level 3 is achievable in a quarter. Level 4 requires organizational will and the right tooling.

Frequently Asked Questions

What's the difference between an error budget and an SLA?

An SLA (Service Level Agreement) is a contractual commitment to customers, usually with financial penalties for violations. An error budget is an internal operational tool derived from your SLO — it quantifies how much unreliability you can absorb in a window. SLAs are typically set at lower thresholds than your SLOs (e.g., 99.5% SLA, 99.9% SLO) to give you a buffer. Breaching an error budget is an internal trigger for engineering action. Breaching an SLA is a customer-facing contractual failure.

Can we exclude vendor-caused downtime from our error budget?

Contractually, with customers, sometimes. Operationally, you shouldn't. Your users experienced the failure whether Stripe caused it or you caused it. Excluding vendor downtime from your internal error budget creates a false sense of reliability and leaves you unable to prioritize vendor monitoring as a reliability investment. Track it separately for attribution purposes, but count it against the budget.

How do we handle planned maintenance windows?

Planned maintenance should be accounted for explicitly. Some organizations exclude maintenance windows from SLO calculations entirely; others count them but budget for them intentionally. The standard approach: schedule maintenance when error budget is healthy (>50% remaining), avoid scheduling during high-traffic periods, and always communicate windows via your status page in advance. If maintenance causes more downtime than budgeted, it eats into your error budget like any other incident.

What if our SLO target is unrealistic?

Reset it. An SLO you consistently miss is worse than no SLO — it trains your team to ignore reliability metrics. The right SLO is the highest target you can realistically maintain while shipping features at your current velocity. Start with your historical reliability data, subtract vendor dependency risk, and set a target you can defend. You can always tighten it later.

How do we get buy-in from product and engineering leadership for error budget policies?

Quantify the cost of reliability failures in terms leadership cares about: customer churn, SLA credits issued, engineering time spent on incidents. Then present the error budget policy as the mechanism that prevents those costs — not as a constraint on velocity, but as the signal that tells you when to invest in reliability vs. when it's safe to ship fast. The framing matters: error budgets are not a speed limiter. They're a speed enabler when they're healthy.

Nuno Tomas Founder of IsDown

For IT Managers

Monitor all your dependencies in one place

One dashboard with all vendors statuses

A bird's-eye view of all your services in one place.

Get alerts when your vendors are down

Notifications in Slack, Datadog, PagerDuty, etc.

Start Free Trial

Sep 30, 2025

Top 10 Reasons Why You Need a Status Page Aggregator

Discover why a status page aggregator is essential for monitoring multiple vendors. Learn how to centralize alerts and improve incident response.

Jun 16, 2026

IsDown is joining UptimeRobot

IsDown has been acquired by UptimeRobot. Your plan, login, and data stay the same. Here's what's changing, what isn't, and the legal details.

May 13, 2026

Cloud Outage History: Six Years of Recurring Failures

Six years of major cloud outages dissected - AWS, Cloudflare, CrowdStrike and more. Root causes, failure patterns, and what SRE teams keep getting wrong.

May 3, 2026

April 2026: IsDown Users Saved 16.5 Hours with Early Outage Detection

IsDown detected 45 outages up to 3.6 hours before vendors acknowledged them in April 2026, plus 104 incidents vendors never reported.

Apr 22, 2026

AWS Outage History: What Engineering Teams Should Learn

AWS outage history follows a predictable pattern: us-east-1, cascade failures, status pages that lag 30-90+ minutes. Here's what engineering teams should learn.

Apr 6, 2026

March 2026: IsDown Users Saved 10.5 Hours with Early Outage Detection

IsDown detected 33 outages up to 2.3 hours before vendors acknowledged them in March 2026, plus 87 incidents vendors never reported.

Never again lose time looking in the wrong place

Start Monitoring in 5 minutes

14-day free trial · No credit card required · No code required