Trusted by 1,000+ teams
Stop finding out about outages from your users. Monitor 6,320+ cloud services and get alerted the second something breaks.
Async "pause" operations (step.waitForEvent, step.invoke, cancelOn) should be running again at typical throughput.
Event batching still has an backlog that we are actively working through.
Status:
• Function scheduling - Running as expected
• Function execution - Running as expected, no queue backlogs
• Async "pause" opts (waitForEvent, invoke, cancelOn) - Running as expected
• Event batching - Significant backlog
The function run backlog for functions has been resolved as of 11:10AM PT.
Batched functions and `step.waitForEvent` may face delays as the backlog continues to process.
Function execution scheduling and processing throughput is at normal levels.
Async "pause" operations (step.waitForEvent, step.invoke, cancelOn) are severely backlogged which may cause delays in any of these operations from completing. This may cause issues with your function execution if you rely on them. The team is working on fixes and clear this backlog and fix the key issues. We do not yet have an ETA on resolving this specific issue.
Function scheduled delays should be caught up. With events processed and new runs scheduled, your system my still see backlogs based on your function's flow control (e.g. concurrency) config and your account's concurrency.
We still see backlogs in processing step.waitForEvent, step.invoke and cancelOn event expressions. We are continuing to work on this.
We also are continuing our rollout of isolated batch processing as previously mentioned to further isolate parts of our system.
EDIT - This was edited to include step.invoke as well for completeness.
The system is consuming the event backlog as fast as possible, with an ETA of ~10-15 minutes until function scheduling is caught up.
After function scheduling is caught up, function execution in your account may still be limited by your account concurrency or a given function's own flow control settings (concurrency, rate limit, etc.).
We will continue to share more updates as soon as we can.
There is an increase in throughput since 15:37 UTC (~15 min ago). We are continuing to apply changes and prepare a larger change to decouple parts of the system.
Event observability: Events may be delayed when appearing in the dashboard as the database ingestion for these events is also related to this part of the system that handles function scheduling.
Events continue to be ingested and the Event API remains unaffected.
Changes have increase throughput, but not yet to typical levels. We are actively testing the new system change to decouple batch processing before enabling it for all accounts.
We're deploying an in-memory optimization within the the part of the system the schedules new function runs. This optimization will alleviate pressure on underlying systems and increase throughput. The change will be rolled out momentarily.
We're also working in parallel on a system change to create a dedicated service for processing for event batching which is the cause of the overall backlog on the system.
We have scaled up several resources across the system and to handle a large increase in scale within the system. Services are scaled up and we have also added new function state shards, but rollout of those new shards can take up to ~30m.
We are also working on networking improvements to improve efficiency of the system with this significantly higher load.
We are actively investigating delays with function run scheduling for a subset of customers. We will provide further updates as we identify the cause and resolve the issue.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6320 services available
Integrations with