TL;DR: Error budgets are the operationalized form of your SLO: they quantify exactly how much unreliability you can afford in a given window. A 99.9% availability SLO gives you 43.2 minutes of downtime per month — that's your budget. Burn it fast, freeze feature work. Burn it slow, ship faster. The system only works if you measure burn rate in real time, hold vendors accountable (their downtime counts against your budget too), and enforce the policy without exceptions.
An error budget is the acceptable amount of unreliability permitted by your SLO over a defined time window. It is not a target. It is not a stretch goal. It is a hard ceiling that, when breached, should trigger a pre-agreed organizational response — feature freezes, postmortems, or infrastructure investment.
The formula is blunt:
Error Budget = 1 - SLO Target
Error Budget (time) = (1 - SLO Target) × Window Duration
For a 30-day window:
That last number should make you uncomfortable. Four minutes across an entire month. A single bad deploy, a flapping dependency, one vendor going dark — and you're over budget before your on-call engineer has finished reading the alert.
The SLO error budget calculation requires three inputs:
For a request-based SLI:
Error Budget (requests) = Total Requests × (1 - SLO Target)
Budget Remaining = Error Budget - Bad Requests Observed
Example: 10 million requests in a 30-day window with a 99.9% SLO:
For time-based SLIs (availability):
Error Budget Consumed = Downtime Minutes / Total Minutes in Window
Budget Remaining % = 1 - (Downtime Minutes / Budget Minutes)
Raw budget remaining is a lagging indicator. Burn rate is the signal. Burn rate measures how fast you're consuming your error budget relative to how fast you're earning it.
Burn Rate = Error Rate / (1 - SLO Target)
Where Error Rate = bad requests / total requests over the measurement window.
At a burn rate of 1.0, you're consuming your budget at exactly the rate it accrues — you'll hit 0% remaining at the end of the window with nothing to spare. That's already bad.
At a burn rate of 14.4 over a 1-hour window, you're consuming 2% of your monthly error budget per hour — enough to warrant an immediate page.
Google's SRE workbook recommends a multi-tier alerting strategy:
Here's what the textbooks don't tell you: your error budget burns whether the outage is your fault or not.
When Stripe goes down, your payment flow fails. When your cloud provider's managed database flaps, your API returns 503s. When your CDN has a regional incident, your users experience timeouts. Every one of those bad requests counts against your SLO. Your SLO doesn't have a "vendor caused it" exemption clause — and neither do your users.
The failure mode looks like this:
The mitigation is external, independent monitoring — not trusting the vendor to tell you they're broken. Tools like IsDown's status page aggregator (https://isdown.app/status-page-aggregator-as-a-service) monitor hundreds of third-party services in real time and alert you the moment a dependency degrades, before your own systems surface the downstream impact. This isn't a nice-to-have for teams with aggressive SLOs — it's table stakes.
An error budget without a policy is just a number on a dashboard that engineers ignore when deadlines are tight. The policy is the mechanism that forces the tradeoff between reliability and velocity into the open.
A minimal, enforceable policy has three components:
Who can override a feature freeze? In practice, the answer should be: almost nobody. If the VP of Engineering can wave away the policy every time a release is under pressure, the policy has zero value. Define the override process — and make it expensive enough that it's used rarely.
Budget reviews should happen weekly, not monthly. A monthly review means you discover you've been over budget for three weeks. A weekly review lets you course-correct before the situation becomes a postmortem.
In a microservices architecture, you'll have dozens of services, each with their own SLOs. The practical question: should each service have an independent error budget, or do you roll up to a user-facing composite SLO?
Best Practice: Both. Maintain per-service SLOs for engineering accountability, and maintain a user-facing composite SLO (often called a "journey SLO" or "customer SLO") for executive reporting and policy enforcement.
The user-facing composite SLO is what drives the error budget policy. Individual service SLOs drive team-level prioritization.
Burn rate alerts are only useful if they reach the right people at the right time. The operational layer matters as much as the math.
For teams using PagerDuty, burn rate alerts should map to your existing escalation policies: fast burn (>14.4x) goes to the on-call engineer immediately; slow burn (>6x over 6 hours) can route to a lower-urgency queue. Connecting your monitoring to PagerDuty (https://isdown.app/integrations/pagerduty) means burn rate spikes — including those caused by third-party dependency failures — trigger your existing workflows without requiring a separate tool stack.
The architecture recommendation:
The dependency monitoring layer is the one most teams skip. It's also the one that catches the incidents that blindside you in the middle of the night when a payment processor goes dark and your status page still shows green.
Be honest about where your organization sits:
Level 0 — No SLOs: Reliability is vibes-based. Incidents are declared when someone important complains.
Level 1 — SLOs exist, no policy: You have a dashboard that shows error budget. Nobody changes behavior based on it. Feature freezes have never happened.
Level 2 — Policy exists, rarely enforced: The policy is in the runbook. It gets overridden when sprint pressure is high. The error budget review is a 10-minute agenda item that gets cut when the meeting runs long.
Level 3 — Policy is enforced, third-party risk is unaccounted for: Feature freezes happen. But vendor downtime still blindsides you, and the root cause analysis of half your budget overruns is "dependency X was degraded."
Level 4 — Full stack: Policy enforced, vendor dependencies monitored independently, burn rate alerts are multi-window, and the SLO target was set based on actual dependency reliability math — not wishful thinking.
In our experience, most teams are at Level 1 or 2. Level 3 is achievable in a quarter. Level 4 requires organizational will and the right tooling.
An SLA (Service Level Agreement) is a contractual commitment to customers, usually with financial penalties for violations. An error budget is an internal operational tool derived from your SLO — it quantifies how much unreliability you can absorb in a window. SLAs are typically set at lower thresholds than your SLOs (e.g., 99.5% SLA, 99.9% SLO) to give you a buffer. Breaching an error budget is an internal trigger for engineering action. Breaching an SLA is a customer-facing contractual failure.
Contractually, with customers, sometimes. Operationally, you shouldn't. Your users experienced the failure whether Stripe caused it or you caused it. Excluding vendor downtime from your internal error budget creates a false sense of reliability and leaves you unable to prioritize vendor monitoring as a reliability investment. Track it separately for attribution purposes, but count it against the budget.
Planned maintenance should be accounted for explicitly. Some organizations exclude maintenance windows from SLO calculations entirely; others count them but budget for them intentionally. The standard approach: schedule maintenance when error budget is healthy (>50% remaining), avoid scheduling during high-traffic periods, and always communicate windows via your status page in advance. If maintenance causes more downtime than budgeted, it eats into your error budget like any other incident.
Reset it. An SLO you consistently miss is worse than no SLO — it trains your team to ignore reliability metrics. The right SLO is the highest target you can realistically maintain while shipping features at your current velocity. Start with your historical reliability data, subtract vendor dependency risk, and set a target you can defend. You can always tighten it later.
Quantify the cost of reliability failures in terms leadership cares about: customer churn, SLA credits issued, engineering time spent on incidents. Then present the error budget policy as the mechanism that prevents those costs — not as a constraint on velocity, but as the signal that tells you when to invest in reliability vs. when it's safe to ship fast. The framing matters: error budgets are not a speed limiter. They're a speed enabler when they're healthy.
Nuno Tomas
Founder of IsDown
For IT Managers
Monitor all your dependencies in one place
One dashboard to check all status pages
A bird's-eye view of all your services in one place.
Get alerts when your vendors are down
Notifications in Slack, Datadog, PagerDuty, etc.
14-day free trial · No credit card required · No code required