Use Cases
Software Products E-commerce MSPs Schools Development & Marketing DevOps Agencies Help Desk
 
Internet Status Blog Pricing Log In Try IsDown for free now

SLA vs SLI vs SLO: What Your Team Actually Needs to Track

Published at Sep 3, 2025.
SLA vs SLI vs SLO: What Your Team Actually Needs to Track

TL;DR: SLIs are the raw measurements (latency, error rate, availability). SLOs are the internal targets your team commits to. SLAs are the contractual consequences when you miss them. Most teams get the order backwards — they copy an SLA percentage from a vendor contract and call it a reliability target. That's not a reliability strategy. It's a liability document. Here's how to actually use these metrics to run reliable services.

The Three Metrics — Defined Without Jargon

Before getting into what teams get wrong, let's lock in the definitions. These are not interchangeable terms.

SLI — Service Level Indicator

An SLI is a quantitative measurement of a specific behavior of your service. It's the raw signal. It tells you what's actually happening right now.

Examples of real SLIs:

  • Availability: Percentage of HTTP requests that return a non-5xx response
  • Latency: Percentage of requests served in under 300ms (p99)
  • Error rate: Ratio of failed API calls to total API calls over a 5-minute window
  • Throughput: Number of successful job completions per minute for a batch pipeline
  • Durability: Percentage of stored objects retrievable on demand (critical for storage systems)

The key word: measurable. If you can't measure it from your systems, it's not an SLI — it's a wish.

SLO — Service Level Objective

An SLO is the internal reliability target you set for a given SLI. It's the line you're trying not to cross. It should be ambitious enough to matter but realistic enough that your team isn't perpetually in incident response.

Format: [SLI] will be [threshold] over [time window]

Examples:

  • Availability SLO: 99.5% of requests will succeed over a rolling 30-day window
  • Latency SLO: 99% of API calls will complete in under 500ms over a rolling 7-day window
  • Error budget: We can afford 3.6 hours of downtime per month before we're in SLO violation (at 99.5% availability)

SLOs are owned by your engineering team. They're internal. Nobody outside the company has to see them (and they probably shouldn't — more on that below).

SLA — Service Level Agreement

An SLA is a contractual commitment to customers with defined consequences for failure. It's a legal document. Penalties are typically financial (credits, refunds), but can include termination rights or escalation procedures.

SLAs are almost always more lenient than SLOs — by design. Your SLO should be your internal safety margin. The SLA is the floor you fall through before a lawyer gets involved.

Term Owner Audience Consequence of Breach Typical Tightness
SLI Engineering Internal Informs SLO status N/A (raw data)
SLO Engineering / SRE Internal + leadership Error budget burn, feature freeze Tighter than SLA
SLA Legal / Product Customers / Contracts Credits, refunds, churn Looser than SLO

The Error Budget: The Concept That Actually Changes Behavior

If SLOs are your targets, the error budget is what makes them operational (Google SRE Book).

Error budget = 100% - SLO target

If your availability SLO is 99.9% over 30 days, you have 43.2 minutes of allowable downtime per month. That's your budget. Spend it on risky deployments, infrastructure migrations, or chaos experiments — but once it's gone, you stop shipping features and fix reliability.

This is the mechanism that forces the conversation between product and engineering. Without a defined error budget, reliability is always someone else's problem until production is on fire.

The Hard Truth: Most teams have SLOs on a dashboard somewhere. Almost none of them have a written policy that says "when error budget drops below 20%, we stop deploying new features." Without that policy, your SLO is just a number. It doesn't change anyone's behavior.

Where Teams Go Wrong

This is the section most blog posts skip. Here are the anti-patterns that kill SLO programs before they start.

Anti-Pattern #1: Setting SLOs by Copying Vendor SLAs

You sign a contract with AWS. AWS promises 99.99% uptime for EC2 deployed across multiple Availability Zones (99.5% for single-instance deployments) (AWS SLA) . So you set your SLO at 99.99%.

The problem: Your SLO now has zero margin. If AWS uses its full ~4 minutes of contractual allowable monthly downtime, you've already breached your own SLO - and that's before your own code contributes a single error.

Best Practice: Set your SLO at least 0.5–1 percentage point tighter than any upstream SLA you depend on. Your SLO must account for your code, your dependencies, and your deployment risk — not just infrastructure uptime.

Anti-Pattern #2: Measuring Availability as "Is the Server Up?"

A server that returns HTTP 200 with a blank page is technically "up." A load balancer that times out after 60 seconds is technically "responsive." A database replica that's 48 hours behind is technically "running."

None of these are "available" from a user's perspective.

Best Practice: Define SLIs from the user's point of view. The right question is: "Can a user successfully complete the action they're trying to take?" Synthetic monitoring, real user monitoring, and canary checks get you closer to truth than ping tests.

Anti-Pattern #3: SLOs Without Time Windows

"99.9% availability" is not an SLO. It's a number without context. 99.9% over a rolling 30 days is a specific, measurable, actionable target.

Best Practice: Every SLO needs:

  • A specific SLI (what you're measuring)
  • A threshold (the target value)
  • A time window (rolling 7-day, rolling 30-day, calendar quarter)
  • A burn rate alert (when are you on track to exhaust the error budget?)

Anti-Pattern #4: Too Many SLOs

Some teams instrument everything and end up with 40 SLOs. Nobody looks at 40 SLOs. Nobody builds on-call runbooks for 40 SLOs.

Best Practice: Start with 3–5 SLOs per service. Pick the ones that directly reflect user experience. You can add more later. SLO programs that start simple and expand tend to survive. Programs that launch with comprehensive coverage tend to get ignored within 90 days.

Anti-Pattern #5: Publishing SLOs Directly as SLAs

Your product team wants to put "99.95% uptime" on the pricing page. That's not an SLA — it's a marketing claim with legal teeth you're not ready to commit to.

The Hard Truth: If your SLA says 99.95% and you've never run a postmortem on what that actually means operationally, you're writing a check your on-call rotation can't cash.

Best Practice: Keep SLOs internal until you've validated them for at least one full quarter. Only externalize them as SLAs when legal, product, and engineering have agreed on: (a) what counts as a breach, (b) what the compensation model is, and (c) how you'll detect and communicate breaches in real time.

The Third-Party Dependency Problem

Here's the part nobody talks about in SLI/SLO/SLA content: your SLOs are only as good as your assumptions about upstream reliability.

Let's say you depend on Stripe for payments, AWS for compute, Twilio for notifications, and Salesforce for CRM. None of those are inside your blast radius. You can't deploy a fix when Stripe has a payment processing incident. You can't roll back when AWS us-east-1 has elevated error rates.

But they will absolutely show up in your SLIs.

When a third-party vendor goes down:

  • Your error rate spikes — but the errors aren't in your code
  • Your latency degrades — but there's nothing in your deployment history to blame
  • Your error budget burns — on an incident you didn't cause and couldn't prevent
  • Your on-call engineer gets paged — and spends 20 minutes confirming it's not you before checking the vendor's status page

This is a reliability program design problem. If your SLOs don't account for third-party failure modes, you'll burn error budget on incidents you have zero control over.

Pro-Tip: Separate your error budget attribution into "own-caused" and "dependency-caused" categories. If a third-party outage drives an SLO breach, that's still a breach — but the corrective action is completely different. You don't fix it with better code. You fix it with better dependency monitoring, circuit breakers, and graceful degradation.

How to Stop Being Surprised by Vendor Outages

The practical fix is early detection. The faster you know a vendor is having issues, the faster you can:

  • Stop alerting on-call engineers on symptoms that aren't their fault
  • Communicate proactively to customers before they open support tickets
  • Trigger fallback behavior in your services (circuit breakers, degraded mode UX)
  • Accurately attribute SLO misses in postmortems

IsDown monitors 6,000+ official status pages (AWS, Stripe, GitHub, Datadog, Twilio, and more) and surfaces incidents before they hit your own monitoring.

You can route those alerts directly into your existing workflows — whether that's a Slack channel your on-call team watches or PagerDuty to correlate with your own incident alerts. This context cuts MTTD and keeps your error budget attribution honest.

Stop spending 20 minutes per incident confirming it's not you. See which vendors IsDown monitors →

Building a Realistic SLO Program: The Sequence That Works

If you're starting from scratch or restarting after a failed first attempt, here's the sequence:

  • Step 1 — Pick two user journeys that actually matter. Not infrastructure metrics. User-facing outcomes: "checkout completes successfully," "dashboard loads in under 2 seconds."
  • Step 2 — Define the SLIs that measure those journeys. Make them specific and automatable.
  • Step 3 — Set SLO targets based on current performance minus a realistic improvement curve. Don't set 99.99% if you're currently at 99.1%.
  • Step 4 — Calculate the error budget and put it in front of the whole team, including product.
  • Step 5 — Write the policy: what happens when error budget drops below 50%? Below 20%? Below 0%? Without written policy, the SLO is decoration.
  • Step 6 — Add third-party dependency tracking: identify your critical upstream vendors and monitor them independently so their incidents don't silently corrupt your attribution data.

Run this for one quarter. Then add SLOs for your next two most important journeys. Build gradually — reliability programs that try to measure everything immediately measure nothing effectively.

Frequently Asked Questions

What's the difference between an SLO and an SLA?

An SLO is your internal reliability target — it belongs to engineering and exists to guide operational decisions. An SLA is a contractual commitment to customers with financial or legal consequences for breach. SLOs should be tighter than SLAs; the gap between them is your safety buffer. If your SLO and SLA are the same number, you have no margin for error.

How many SLOs should a service have?

Start with 3–5 per service, focused entirely on user-facing outcomes. Most teams that start with more than that end up ignoring most of them. SLOs you don't look at don't improve your reliability — they just create noise. Expand once your team is actually making operational decisions based on error budget burn.

What counts as downtime for SLA calculation purposes?

This is almost always negotiated and documented in the SLA itself. Common exclusions include scheduled maintenance windows, incidents caused by the customer, force majeure events, and third-party service failures outside the vendor's control. The ambiguity in "third-party failures" is exactly why you need your own dependency monitoring — you need to be able to prove attribution during a dispute.

My SLO keeps getting violated by vendor outages I can't control. What do I do?

First, separate your incident attribution: tag each SLO miss as "own-caused" or "dependency-caused" in your postmortems. This gives you accurate data for the reliability conversation with leadership. Second, invest in early detection — knowing a vendor is having issues before your users report it is the difference between proactive communication and reactive firefighting. Third, audit whether your SLO target is realistic given your dependency stack. If you depend on five external services that each have 99.9% SLAs, your theoretical maximum availability is already below 99.5%.

Should I publish my SLOs publicly?

Only when they're externalized as SLAs with explicit consequences for breach. Publishing SLO targets without a defined compensation model creates customer expectations you may not be able to meet, with no contractual framework to handle the fallout. Keep SLOs internal until they're stable and you've agreed on what breach actually means operationally and legally.

How do I set realistic SLO targets for a new service?

Use your current observed performance as the baseline. Set your initial SLO target at roughly current p30 performance — meaning 70% of the time you're already meeting it, with room to improve. Don't set aspirational targets for new services; set measurable ones that reflect reality. You can tighten the target as reliability improves. Starting too aggressive means you burn error budget immediately and the team loses faith in the program.

Nuno Tomas Nuno Tomas Founder of IsDown

The Status Page Aggregator with Early Outage Detection

Unified vendor dashboard

Early Outage Detection

Stop the Support Flood

14-day free trial • No credit card required

Related articles

Never again lose time looking in the wrong place

14-day free trial · No credit card required · No code required