Use cases
Software Products E-commerce MSPs Schools Development & Marketing DevOps Agencies Help Desk
Company
Internet Status Blog Pricing Log in Get started free

Error Budget in SRE: Stop Treating It as a Number and Start Using It as a Tool

Published at Sep 4, 2025.

TL;DR: An error budget is the allowed amount of unreliability in your system over a given period. Most teams calculate it, put it on a dashboard, and ignore it until something breaks. That's the wrong approach. Your error budget should drive release decisions, prioritize reliability work, and determine how fast you move. If you're not using it as a policy instrument, you're just doing math for fun.

What Is an Error Budget in SRE?

An error budget is what you get when you flip your SLO upside down. If your availability SLO is 99.9%, you have 0.1% of requests (or time) per month where your service is allowed to fail. That 0.1% is your error budget.

SLO TargetMonthly Budget (minutes)Weekly Budget (minutes)Daily Budget (minutes)
99.0%432100.814.4
99.5%21650.47.2
99.9%43.810.11.44
99.95%21.95.00.72
99.99%4.381.00.14

This is the math everyone knows. The problem is what happens next, or rather, what doesn't happen next.

The Real Point: Budget as a Decision Framework

Google's SRE book introduced the error budget concept with a clear purpose: create a shared currency between product and engineering. When you have budget remaining, you can take more risk. When your budget is exhausted, you slow down and focus on reliability.

The error budget isn't a KPI. It's a policy instrument. If you don't have a written policy attached to it, you just have a number.

The Hard Truth: Most engineering teams have SLOs defined somewhere. Almost none of them have an error budget policy: a documented set of rules that defines what actually changes when the budget is healthy, stressed, or exhausted. They track the number. They don't act on it. That's not reliability engineering. That's reliability theater.

What an Error Budget Policy Actually Looks Like

An error budget policy turns budget state into concrete operational decisions. It answers three questions:

  • What changes when we have budget remaining? Can we ship features faster? Run experiments? Skip some rollout steps?
  • What changes when we're approaching the limit? Do we slow releases? Add extra review gates? Pause non-critical deploys?
  • What changes when we've exhausted the budget? Do we freeze feature work? Escalate to leadership? Trigger an incident review?
Budget StateThresholdEngineering ResponseProduct Response
Healthy>50% remainingNormal release cadence, experiments allowedFeature velocity prioritized
Stressed20–50% remainingIncreased review gates, slower deploysReliability work gets a seat at the table
Critical<20% remainingFeature freeze, reliability sprintsNo new features until budget recovers
Exhausted0% remainingFull stop on feature work, incident review requiredLeadership notified, SLO renegotiation triggered

Getting the Policy off the Page

A policy table is only useful if people know it exists and agree to follow it. The most common failure mode isn't a badly designed policy. It's a policy that lives in a Confluence page nobody reads.

A few things that make policies stick:

  • Make it part of onboarding. Every engineer who joins the team should read the error budget policy in their first week, alongside the runbooks and the on-call guide. If it's not in onboarding, it's not a policy. It's a document.
  • Review it quarterly. SLOs change. Team size changes. Risk tolerance changes. A policy written six months ago may no longer reflect how the team actually operates. Quarterly reviews keep it honest.
  • Get explicit sign-off from product. The policy only works if product management agrees to its consequences. If the response to a budget exhaustion event is a feature freeze, product needs to have agreed to that outcome in advance, not be surprised by it during a tense incident review.
  • Make the current budget state visible by default. If engineers have to go looking for the error budget number, most won't. It should be on the main engineering dashboard, visible in sprint planning, and surfaced automatically in incident channels when it drops below a threshold.

The Hard Truth about Policy Exceptions: Every team eventually faces the situation where the budget is exhausted but the business reason to ship is strong. The pressure to make an exception is real. The policy needs to define in advance what constitutes a legitimate exception (a vendor-caused outage, a security patch, a regulatory requirement) and what doesn't. If exceptions aren't defined, every exception becomes a negotiation, and the policy loses its teeth.

The Third-Party Vendor Problem Nobody Talks About

Here's the error budget scenario that breaks most teams: your service goes down because AWS, Stripe, PagerDuty, or some other vendor had an outage. Your users see degraded service. Your SLO takes a hit. Your error budget burns. You didn't write a single line of bad code.

Third-party vendors are a budget risk you carry but don't control. What you should be doing:

  • Track third-party availability separately. Know which portion of your budget was burned by external dependencies versus internal failures.
  • Monitor vendor status proactively. If you find out about a vendor outage from a user report, you're already losing. Tools like IsDown aggregate status pages from 6,000+ vendors in one place. Connecting IsDown's PagerDuty integration means your team gets vendor context the moment something upstream breaks, so you can act before users start filing tickets.
  • Build policy exceptions for external budget burns. If your policy says "feature freeze when budget drops below 20%": does that still apply when the budget was burned by a vendor you can't control? That needs a written answer.
  • Use budget burn rate to identify fragile dependencies. If the same vendor accounts for 40% of your SLO violations over six months, that's a build-vs-buy signal.

Pro-Tip: When a third-party vendor outage burns your error budget, open a vendor reliability tracking ticket with the date, duration, and budget impact. After six months, review these. Patterns emerge fast, and they give you hard data to bring to vendor negotiations or architectural redesign conversations.

What a Vendor Outage Actually Costs You

The impact of a third-party outage on your error budget depends on three variables: how long the outage lasts, how much of your traffic depends on that vendor, and how quickly you detect and respond.

Consider a realistic scenario: your payment provider goes down for 45 minutes during peak hours. Your checkout flow fails completely for that window. If your SLO is 99.9% on a 30-day rolling window, your total monthly budget is 43.8 minutes. A single 45-minute vendor outage just blew your entire budget for the month, with nothing left for any internal failures, deployments, or planned maintenance.

This is why detection speed matters as much as the outage itself. Every minute between the vendor going down and your team knowing about it is a minute of budget burning with no mitigation in place. Teams that detect vendor outages from user complaints typically lose 15 to 30 minutes before anyone starts investigating. Teams monitoring vendor status pages directly cut that window to under five minutes.

That difference, compounded over a year, is the difference between consistently meeting your SLO and consistently missing it.

When to Escalate vs. When to Wait

Not every vendor degradation warrants an all-hands response. A useful framework:

  • Monitor and log, no immediate action: vendor reports degraded performance but your own error rate hasn't moved. Track it, open a ticket, keep watching.
  • Activate fallbacks, notify stakeholders: your error rate is rising and correlates with a vendor incident. Switch to fallback behavior where available, communicate proactively to affected users, and start the budget impact clock.
  • Full incident response: vendor outage is causing significant user impact and budget is burning fast. Treat it like an internal incident. Assign an incident commander, open a channel, and document the timeline for the post-incident review.

The key is having this decision tree written down before an outage happens. When something breaks at 2 AM, nobody should be figuring out the escalation path from scratch.

Error Budget Burn Rate: The Leading Indicator You're Ignoring

Budget remaining is a lagging metric. Burn rate is where the real signal lives.

Burn rate tells you how quickly you're consuming your budget relative to the rate needed to last through the measurement window. A burn rate of 1.0 means you'll exhaust your budget exactly at the end of the period. A burn rate of 2.0 means you'll be out halfway through. A burn rate of 0.5 means you're consuming at half the expected pace and have headroom to move faster.

The practical value of burn rate is that it turns a slow-moving number into an actionable signal. Budget remaining tells you where you are. Burn rate tells you where you're going.

Burn Rate Alerting: Two Windows Are Better Than One

A single burn rate alert is noisy. The Google SRE Workbook recommends combining a short window and a longer window to reduce false positives, firing only when burn rate is elevated in both simultaneously.

A common implementation:

  • Fast burn alert: burn rate > 14 over the last hour. This catches active incidents consuming roughly 5% of your monthly budget in 60 minutes.
  • Slow burn alert: burn rate > 5 over the last 6 hours. This catches gradual degradations that won't trigger a pager immediately but will exhaust your budget before the window closes.

Together, these two conditions cover the spectrum from acute incidents to silent reliability drift. If you only alert on SLO compliance, you'll find out you have a problem after the damage is done. Burn rate alerts give you time to act.

What Burn Rate Spikes Tell You

Not all burn rate spikes mean the same thing. How you respond depends on the pattern:

  • Sharp spike, short duration: likely a single bad deploy or a brief vendor outage. Investigate the timeline, roll back if needed, and check whether the budget impact requires a policy response.
  • Sustained elevated rate: a systemic issue. Could be a flaky dependency, a slow memory leak, or a change in traffic patterns that your infrastructure isn't handling well. This warrants a reliability sprint, not just a rollback.
  • Gradual creep over days: often missed entirely without burn rate monitoring. By the time budget is visibly low, the window is half over. This pattern usually points to cumulative technical debt or a dependency that's quietly degrading.

Spiking burn rate is often your first signal of a serious incident in progress, sometimes before alert thresholds fire. This is why SLO monitoring should include burn rate alerts, not just SLO compliance alerts.

For a deeper dive on how error budgets relate to SLAs and SLIs, see our breakdown of SLA vs SLI vs SLO.

Making Error Budgets Work Across Teams

  • Budget reviews in sprint planning. Start every sprint by reviewing current budget state. It should be as normal as reviewing the backlog.
  • Joint ownership between SRE and product. The error budget policy should be signed off by both sides, not handed down from engineering.
  • Retrospectives include budget analysis. When budget is exhausted, don't just do an incident review. Do a budget review: what burned it? Was it a single incident or cumulative drift?
  • Make the data visible everywhere. Error budget health should be on dashboards in engineering and product standups.

Common Mistakes That Undermine Error Budgets

  • Setting SLOs too high. A 99.99% SLO on a service that realistically achieves 99.9% means you'll be perpetually over budget.
  • No policy = no point. An error budget without a policy is a vanity metric.
  • Resetting budget windows after incidents. This destroys the integrity of the system. The policy should be consistent, including the consequences.
  • Ignoring maintenance windows. Planned maintenance burns budget too. Your policy should account for it.

Frequently Asked Questions

What is an error budget in SRE?

An error budget is the maximum amount of unreliability your service is allowed to have within a given time window, typically a rolling 30-day period. It's derived directly from your SLO: if your availability SLO is 99.9%, your error budget is 0.1% of total requests or time.

What's the difference between an error budget and an SLO?

Your SLO is the target reliability level you're committed to. Your error budget is the inverse: the amount of unreliability that's allowed before you've violated that commitment. SLO = 99.9% means error budget = 0.1%. They're two sides of the same contract, but they serve different purposes: the SLO is the promise, the error budget is the operational tool for managing how you spend your allowed imperfection.

What should happen when we exhaust our error budget?

Your error budget policy should specify this in advance. Typically: freeze new feature deployments, dedicate engineering capacity to reliability improvements, conduct a budget burn review to identify root causes, and potentially renegotiate your SLO if it turns out to be unrealistically aggressive.

How do third-party vendor outages affect my error budget?

They count against it, even though you didn't cause them. When AWS or Stripe has an outage that degrades your service, your users experience unreliability and your SLI takes a hit. The best mitigation is early detection (so you can communicate proactively and implement fallbacks faster) and separate tracking of external vs. internal budget burns, so you can make informed architectural and vendor decisions over time.

How often should I review my error budget and SLOs?

At minimum quarterly, and always after a budget exhaustion event. SLOs should be living documents. If you're consistently at 50% of budget, your SLO might be too lenient. If you're constantly exhausted, it might be too aggressive, or you have a reliability problem that needs fixing before any SLO renegotiation.

Nuno Tomas Nuno Tomas Founder of IsDown

For IT Managers

Monitor all your dependencies in one place

One dashboard to check all status pages

A bird's eye view of all your services in one place.

Get alerts when your vendors are down

Notifications in Slack, Datadog, PagerDuty, etc.

Related articles

Never again lose time looking in the wrong place

14-day free trial · No credit card required · No code required