# Root Cause Analysis (RCA)
Date: January 20, 2026
Status: Resolved
## Summary
On January 20, 2026, we experienced a prolonged service disruption during a backend infrastructure upgrade intended to improve system capacity. While the upgrade was expected to cause only a brief interruption, an issue during the cloud provider’s upgrade process prevented the database from completing its automatic failover, extending the impact. No customer data was lost or corrupted, and full service has been restored.
## What Happened
Earlier in the day, increased traffic caused elevated load on a backend database component, resulting in degraded performance. To address this, we initiated a capacity upgrade designed to improve stability and prevent further impact.
During the upgrade, an internal failure occurred within the cloud provider’s managed database service. This failure prevented the database from completing its expected automatic failover. As a result, the system became unavailable for write operations longer than anticipated. The upgrade process became stuck mid-operation and required direct intervention from the cloud provider (AWS) to safely complete.
## Impact to Customers
• Customers experienced service interruptions and were unable to perform write actions (such as updates or changes).
• Candidates were not able to submit availability nor submit schedules
• The disruption lasted longer than expected due to the failed upgrade process.
• No customer data that was entered to the system was lost, corrupted, or exposed.
## Timeline (2026-01-20)
• 9:00am PST - Replication lag cyclical pattern began on instance-3
• 12:04pm PST - Peak disk queue depth observed on instance-3 (114.34)
• 12:13pm PST - Instance-2 began showing elevated disk queue depth
• 12:57am PST - Root cause identified as instance class limitation
• ~2:00pm PST - Remediation initiated: Instance class upgrade to db.r6gd.xlarge started
• ~2:30 PST - Escalation: Database upgrade failed; site entered read-only mode
• ~3:00 PST - Escalation: AWS Business Support engaged
• ~6:30 PST - Service restored: Site returned to full read/write capability
• ~19:30 PST - Full recovery: Backend workers scaled back to full capacity
• ~20:00 PST - Incident closed; monitoring confirmed stable operations
## Root Cause
The root cause was a combination of increased system load and a failure during a managed database upgrade performed by our cloud provider. While the upgrade was intended to be low-impact, the provider’s automation became stuck mid-process, preventing timely recovery.
## Resolution
We worked directly with our cloud provider to diagnose and resolve the failed upgrade. Once the issue was corrected, database services resumed normal operation and system performance stabilized.
## Preventive Actions
To reduce the likelihood and impact of similar incidents in the future, we are taking the following steps:
• Increasing baseline capacity and operational headroom to better absorb traffic spikes.
• Improving detection and alerting for early signs of resource saturation.
• Enhancing safeguards and runbooks for infrastructure changes under load.
• Working with our cloud provider to review failure scenarios and escalation procedures for managed service upgrades.
## Current Status
All services are fully operational and performing normally. We continue to monitor the system closely.
We take responsibility for the availability of our service and regret the extended disruption. While part of the incident involved a cloud provider failure, we are applying the lessons learned to further strengthen reliability going forward. Thank you for your patience.
Our infrastructure upgrade is taking longer than expected due to a backend synchronization step.
We’re actively working with our cloud provider to complete the upgrade safely.
No customer data is at risk.
Increased demand caused performance degradation in a backend service.
We’re scaling system capacity to resolve the issue and prevent recurrence.
A short interruption may occur during the upgrade.
We are continuing to investigate this issue.
We are continuing to investigate this issue.
We are currently investigating this issue.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 5450 services available
Integrations with