TL;DR: Outage communication is a skill most teams treat as an afterthought until 3 AM when they're writing their first status update under pressure. This post gives you the exact templates and decision framework to communicate confidently during any incident, from initial acknowledgment through post-mortem, without making the situation worse.
The problem isn't that teams don't care. It's that outage communication requires two things that are in direct conflict during an active incident: moving fast and getting the words exactly right.
Under pressure, most engineers default to one of three failure modes:
The fix isn't better writers. It's better templates that remove the cognitive load of finding the right words while your pager is screaming.
The Hard Truth: Your customers are more forgiving of outages than you think. What they don't forgive is silence, missed promises, and finding out about problems from Twitter instead of from you. The quality of your communication during an outage affects retention more than the outage itself.
Every outage has three distinct communication phases, each with a different goal:
Phase 1: Acknowledgment — "We know something is wrong."
Phase 2: Updates — "Here's what we know and what we're doing."
Phase 3: Resolution — "Here's what happened and what we're doing about it."
Most teams handle Phase 3 reasonably well. It's Phases 1 and 2 that kill them, specifically, starting Phase 1 too late and providing too few Phase 2 updates.
Every outage communication template your team will ever need fits into one of three phases.
Goal: Get something out fast. Silence is worse than uncertainty.
Template A — Known Impact, Unknown Cause:
We're currently investigating an issue affecting [affected service/feature]. Some users may experience [specific impact]. We're working to identify the cause and will provide an update within [30/60] minutes.
Current status: Investigating
Template B — Known Cause, Working on Fix:
We've identified an issue with [technical component] that is causing [customer-visible impact]. Our team is actively working on a fix. We expect to have an update by [specific time].
Current status: Identified
What to fill in:
Pro-Tip: The first update should go out within 15 minutes of incident declaration, even if all you can say is "we're aware and investigating." In our experience, customers who see an update within 15 minutes are far less likely to contact support. The update doesn't need to be useful. It just needs to prove you're watching.
Goal: Keep customers from filling silence with worst-case assumptions.
Template — Progress Update:
Update [number] — [time]
We're continuing to work on [brief description of issue]. [What your team has done / found since last update]. [Current state: e.g., "We've deployed a fix to our staging environment and are running validation tests."]
We expect to provide our next update by [specific time] or sooner if the issue is resolved.
Current status: [Investigating / Identified / Monitoring]
Key rules for Phase 2 updates:
Template — Resolution Notice:
Resolved — [time]
The issue affecting [service/feature] has been resolved as of [time]. All services are operating normally.
What happened: [1-2 sentence plain-language explanation]
Impact duration: [Start time] to [End time] - [X hours/minutes]
What we're doing to prevent recurrence: [Brief action items]
We apologize for the disruption. If you're still experiencing issues, please contact [support link].
What to avoid in resolution notices:
Here's the scenario most communication templates don't cover: the outage isn't your fault. Your vendor is down.
This happens more than most teams admit. AWS goes down and takes your payment processing with it. Stripe has an incident and your checkout breaks. Your CDN has a degradation and your app slows to a crawl.
The Hard Truth: Your customers don't care whose fault it is. They care that your product isn't working. "AWS is having an issue" is legitimate context, but only if you communicate it proactively, not as an excuse after they've already contacted support.
Template — Vendor-Caused Outage:
We're currently experiencing issues with [feature/service] due to an ongoing incident at one of our infrastructure providers. We're monitoring the situation closely and will update as it progresses.
You can track the provider's incident status here: [vendor status page link]
Current status: Monitoring
What this does:
The key to executing this well is knowing about vendor incidents before customers report them to you. IsDown monitors 6,000+ vendor status pages and sends alerts to your IsDown's Slack integration or PagerDuty integration the moment a vendor updates their status page, often faster than their own notification reaches you. When you know first, you communicate first.
Outage communication has a tone register that's distinct from normal product writing. It's not casual, not formal. It's operational.
Do:
Don't:
| Channel | When to Use | Audience |
|---|---|---|
| Status page | Always — this is the source of truth | All customers, automated monitoring tools |
| Major incidents (>30 min duration) or data impact | Customers who don't check your status page | |
| In-app banner | When the affected feature is in active use | Users currently in your product |
| Twitter/social | When customers are already discussing it publicly | Public + customers who follow you |
| Enterprise customer DMs | High-value accounts during any significant incident | Your biggest customers |
The status page should always be updated first. It's the canonical record. Every other channel references it.
Within 15 minutes of declaring an incident. If your team doesn't have a formal incident declaration process, the trigger should be: "If this is still broken in 15 minutes, we post." You don't need to know the cause. You just need to prove you're watching.
Post the first template above: "We're investigating an issue affecting [X]. Some users may experience [Y]. We'll update by [time]." You don't need to know the cause to communicate that you know there's a problem and you're working on it. That's enough for Phase 1.
In most cases, no. Not during the incident. Technical root causes belong in the post-incident review or postmortem, which you can publish 24-48 hours later for significant incidents. During the incident, customers need to know what they can and can't do, not why the database is misbehaving.
Use the vendor-caused outage template above. Be transparent that the issue is with an external provider, link to their status page, and commit to update times. Don't hide behind "infrastructure issues". Customers appreciate honesty about third-party dependencies. The key is being the one to tell them, not making them discover it themselves.
A status page update is real-time operational communication: what's happening now, what you're doing about it. A postmortem is a retrospective analysis: what happened, why, what you're changing. Status updates go out during the incident, every 30-60 minutes. Postmortems go out 24-72 hours after resolution for significant incidents. Both matter, but they serve different audiences and different timings.
Designate one person as the incident communicator whose only job during the incident is communication: not debugging, not fixing. They own the status page updates, the Slack updates, the customer emails. Everyone else focuses on resolution. This separation prevents the two most common errors: communication that's technically accurate but incomprehensible to customers, and technical teams getting distracted by communication tasks when they should be fixing the problem.
Nuno Tomas
Founder of IsDown
The Status Page Aggregator with Early Outage Detection
Unified vendor dashboard
Early Outage Detection
Stop the Support Flood
14-day free trial · No credit card required · No code required