This incident has been resolved.
Teams are continuing recovery efforts and we see improvement as health checks continue to pass. While root cause has not been confirmed, it is believed an earlier data spike caused a mass reconnection for a service supporting event notifications. Recovery has been focused on allowing new connections and retries while managing CPU usage on instances.
Errors remain above expected thresholds but are improving. Teams continue monitoring health checks on services as we recover.
Error rates are improving and health checks are beginning to pass, however, some customers may still experience issues with notifications/connections until the issue is fully resolved.
We are continuing to work on a fix for this issue.
Teams have implemented steps to mitigate high CPU and memory on instances still impacted by this event. Monitoring error rates to determine if instances are able to become healthy once the change takes effect.
Additional mitigation steps have been implemented. Some instances are seeing improvement, however, several instances are continuing to experience high error rates due to topology changes that were made as part of the mitigation. Teams continue to work toward full recovery.
After previous recovery we are now seeing some new spikes in CPU. Teams are implementing additional steps to recover. Next update 13:00 Eastern or sooner as information becomes available.
We are continuing to see improvement as additional mitigation steps have been implemented. Continuing to monitor as error rates decline. Next update 12PM Eastern.
Mitigation steps have had some impact and while we are seeing improvement, we continue to see elevated CPU and memory on instances. Teams continue to investigate and are implementing additional measures to recover. Next update 11:30AM Eastern.
Mitigation steps have been applied. Monitoring to determine if additional fix actions are needed. Next update top of the hour.
Teams are continuing triage and investigation to determine mitigation steps to restore service. Next update bottom of the hour or as information becomes available.
Teams are investigating an issue impacting notifications that may prevent agents from taking interactions. Next update 10AM or as information becomes available.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6020 services available
Integrations with