Trusted by 1,000+ teams
Stop finding out about outages from your users. Monitor 6,320+ cloud services and get alerted the second something breaks.
This incident has been resolved.
The root cause was a backend that started answering slower than expected and caused many internal requests to take too long, clogging up the processing pipelines. These requests were used when probes reconnect, handled by our event processor. The same event processor is also responsible for handling signals of liveness from controllers managing the probes; since these were also delayed, the system eventually determined that these controllers are not healthy and therefore stopped ending probes to them. The (software) probes that stayed connected never saw this problem, but the ones which tried to reconnect accumulated over time, causing the degradation.
We remediated the issue by adding low timeouts to the non-critical parts of the pipeline and thus processing the backlog quickly. As a follow-up we'll add more asynchronous processing to our events, preventing this type of issue from appearing again.
We implemented the necessary improvements to help the situation. We're monitoring this solution.
Over the weekend of 6-7 June we started experiencing an issue where software probes that disconnected for any reason were not allowed to connect again. Over time this caused a gradual decrease in the number of connected probes, up to 20% of software proves or about 10% of the total probe population.
We identified the root cause to be a delay of processing of internal control messages. We executed ad-hoc measures that allowed most probes to temporarily connect again, and we are currently working on a proper solution.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6320 services available
Integrations with