TL;DR: SLIs are the raw measurements (latency, error rate, availability). SLOs are the internal targets your team commits to. SLAs are the contractual consequences when you miss them. Most teams get the order backwards — they copy an SLA percentage from a vendor contract and call it a reliability target. That's not a reliability strategy. It's a liability document. Here's how to actually use these metrics to run reliable services.
Before getting into what teams get wrong, let's lock in the definitions. These are not interchangeable terms.
An SLI is a quantitative measurement of a specific behavior of your service. It's the raw signal. It tells you what's actually happening right now.
Examples of real SLIs:
The key word: measurable. If you can't measure it from your systems, it's not an SLI — it's a wish.
An SLO is the internal reliability target you set for a given SLI. It's the line you're trying not to cross. It should be ambitious enough to matter but realistic enough that your team isn't perpetually in incident response.
Format: [SLI] will be [threshold] over [time window]
Examples:
SLOs are owned by your engineering team. They're internal. Nobody outside the company has to see them (and they probably shouldn't — more on that below).
An SLA is a contractual commitment to customers with defined consequences for failure. It's a legal document. Penalties are typically financial (credits, refunds), but can include termination rights or escalation procedures.
SLAs are almost always more lenient than SLOs — by design. Your SLO should be your internal safety margin. The SLA is the floor you fall through before a lawyer gets involved.
| Term | Owner | Audience | Consequence of Breach | Typical Tightness |
|---|---|---|---|---|
| SLI | Engineering | Internal | Informs SLO status | N/A (raw data) |
| SLO | Engineering / SRE | Internal + leadership | Error budget burn, feature freeze | Tighter than SLA |
| SLA | Legal / Product | Customers / Contracts | Credits, refunds, churn | Looser than SLO |
If SLOs are your targets, the error budget is what makes them operational (Google SRE Book).
Error budget = 100% - SLO target
If your availability SLO is 99.9% over 30 days, you have 43.2 minutes of allowable downtime per month. That's your budget. Spend it on risky deployments, infrastructure migrations, or chaos experiments — but once it's gone, you stop shipping features and fix reliability.
This is the mechanism that forces the conversation between product and engineering. Without a defined error budget, reliability is always someone else's problem until production is on fire.
The Hard Truth: Most teams have SLOs on a dashboard somewhere. Almost none of them have a written policy that says "when error budget drops below 20%, we stop deploying new features." Without that policy, your SLO is just a number. It doesn't change anyone's behavior.
This is the section most blog posts skip. Here are the anti-patterns that kill SLO programs before they start.
You sign a contract with AWS. AWS promises 99.99% uptime for EC2 deployed across multiple Availability Zones (99.5% for single-instance deployments) (AWS SLA) . So you set your SLO at 99.99%.
The problem: Your SLO now has zero margin. If AWS uses its full ~4 minutes of contractual allowable monthly downtime, you've already breached your own SLO - and that's before your own code contributes a single error.
Best Practice: Set your SLO at least 0.5–1 percentage point tighter than any upstream SLA you depend on. Your SLO must account for your code, your dependencies, and your deployment risk — not just infrastructure uptime.
A server that returns HTTP 200 with a blank page is technically "up." A load balancer that times out after 60 seconds is technically "responsive." A database replica that's 48 hours behind is technically "running."
None of these are "available" from a user's perspective.
Best Practice: Define SLIs from the user's point of view. The right question is: "Can a user successfully complete the action they're trying to take?" Synthetic monitoring, real user monitoring, and canary checks get you closer to truth than ping tests.
"99.9% availability" is not an SLO. It's a number without context. 99.9% over a rolling 30 days is a specific, measurable, actionable target.
Best Practice: Every SLO needs:
Some teams instrument everything and end up with 40 SLOs. Nobody looks at 40 SLOs. Nobody builds on-call runbooks for 40 SLOs.
Best Practice: Start with 3–5 SLOs per service. Pick the ones that directly reflect user experience. You can add more later. SLO programs that start simple and expand tend to survive. Programs that launch with comprehensive coverage tend to get ignored within 90 days.
Your product team wants to put "99.95% uptime" on the pricing page. That's not an SLA — it's a marketing claim with legal teeth you're not ready to commit to.
The Hard Truth: If your SLA says 99.95% and you've never run a postmortem on what that actually means operationally, you're writing a check your on-call rotation can't cash.
Best Practice: Keep SLOs internal until you've validated them for at least one full quarter. Only externalize them as SLAs when legal, product, and engineering have agreed on: (a) what counts as a breach, (b) what the compensation model is, and (c) how you'll detect and communicate breaches in real time.
Here's the part nobody talks about in SLI/SLO/SLA content: your SLOs are only as good as your assumptions about upstream reliability.
Let's say you depend on Stripe for payments, AWS for compute, Twilio for notifications, and Salesforce for CRM. None of those are inside your blast radius. You can't deploy a fix when Stripe has a payment processing incident. You can't roll back when AWS us-east-1 has elevated error rates.
But they will absolutely show up in your SLIs.
When a third-party vendor goes down:
This is a reliability program design problem. If your SLOs don't account for third-party failure modes, you'll burn error budget on incidents you have zero control over.
Pro-Tip: Separate your error budget attribution into "own-caused" and "dependency-caused" categories. If a third-party outage drives an SLO breach, that's still a breach — but the corrective action is completely different. You don't fix it with better code. You fix it with better dependency monitoring, circuit breakers, and graceful degradation.
The practical fix is early detection. The faster you know a vendor is having issues, the faster you can:
IsDown monitors 6,000+ official status pages (AWS, Stripe, GitHub, Datadog, Twilio, and more) and surfaces incidents before they hit your own monitoring.
You can route those alerts directly into your existing workflows — whether that's a Slack channel your on-call team watches or PagerDuty to correlate with your own incident alerts. This context cuts MTTD and keeps your error budget attribution honest.
Stop spending 20 minutes per incident confirming it's not you. See which vendors IsDown monitors →
If you're starting from scratch or restarting after a failed first attempt, here's the sequence:
Run this for one quarter. Then add SLOs for your next two most important journeys. Build gradually — reliability programs that try to measure everything immediately measure nothing effectively.
An SLO is your internal reliability target — it belongs to engineering and exists to guide operational decisions. An SLA is a contractual commitment to customers with financial or legal consequences for breach. SLOs should be tighter than SLAs; the gap between them is your safety buffer. If your SLO and SLA are the same number, you have no margin for error.
Start with 3–5 per service, focused entirely on user-facing outcomes. Most teams that start with more than that end up ignoring most of them. SLOs you don't look at don't improve your reliability — they just create noise. Expand once your team is actually making operational decisions based on error budget burn.
This is almost always negotiated and documented in the SLA itself. Common exclusions include scheduled maintenance windows, incidents caused by the customer, force majeure events, and third-party service failures outside the vendor's control. The ambiguity in "third-party failures" is exactly why you need your own dependency monitoring — you need to be able to prove attribution during a dispute.
First, separate your incident attribution: tag each SLO miss as "own-caused" or "dependency-caused" in your postmortems. This gives you accurate data for the reliability conversation with leadership. Second, invest in early detection — knowing a vendor is having issues before your users report it is the difference between proactive communication and reactive firefighting. Third, audit whether your SLO target is realistic given your dependency stack. If you depend on five external services that each have 99.9% SLAs, your theoretical maximum availability is already below 99.5%.
Only when they're externalized as SLAs with explicit consequences for breach. Publishing SLO targets without a defined compensation model creates customer expectations you may not be able to meet, with no contractual framework to handle the fallout. Keep SLOs internal until they're stable and you've agreed on what breach actually means operationally and legally.
Use your current observed performance as the baseline. Set your initial SLO target at roughly current p30 performance — meaning 70% of the time you're already meeting it, with room to improve. Don't set aspirational targets for new services; set measurable ones that reflect reality. You can tighten the target as reliability improves. Starting too aggressive means you burn error budget immediately and the team loses faith in the program.
Nuno Tomas
Founder of IsDown
The Status Page Aggregator with Early Outage Detection
Unified vendor dashboard
Early Outage Detection
Stop the Support Flood
14-day free trial · No credit card required · No code required