Trusted by 1,000+ teams
Stop finding out about outages from your users. Monitor 6,320+ cloud services and get alerted the second something breaks.
✅ After the infrastructure upgrades and applied improvements, message processing has returned to normal and is now happening in real time. We will continue to closely monitor performance and apply additional optimizations to ensure ongoing stability.
Processing times are now normalized, but the team continues to closely monitor the infrastructure and will keep working on further performance improvements.
Impact:
A sudden ~4x increase in message volume caused significant delays in processing external events (WhatsApp webhooks, email, etc.), impacting the application’s ingestion time.
Root Cause:
The bottleneck occurred in a shared infrastructure resource that was not scaled to handle the sudden traffic spike. In addition, the notifications feature contributed to excessive resource consumption.
Resolution:
Infrastructure upgrades and optimizations to the event intake logic were implemented, allowing the system to resume real-time message processing. As a temporary measure, the notifications feature was disabled. Processing times are now normalized, but the team continues to closely monitor the infrastructure and will keep working on further performance improvements.
With the recent infrastructure adjustments and performance improvements, events are now being processed faster than they arrive. As a result, the application should normalize soon, and new messages will be ingested increasingly closer to real time.
We are continuing to optimize message intake performance while gradually scaling up the infrastructure. At the moment, the average processing time for incoming messages on Cloud Chat instance-1 is around 50 minutes.
We are about to roll out a new, more optimized message intake logic. The deployment is expected to take place within the next 30 minutes.
We identified that the overloaded resource is one of our workers (Sidekiq), which received a sudden surge in messages and is therefore processing tasks with delays.
Two teams are already working in parallel: one provisioning additional infrastructure and the other optimizing this worker’s logic.
We are currently investigating this issue and will provide updates here as they become available
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6320 services available
Integrations with