One place to monitor all your cloud vendors. Get instant alerts when an outage is detected.
The volume of queued jobs continues to accumulate, and we are implementing corrective actions to prevent further delays. A second set of rolling restarts is in progress to address any jobs that are currently stuck. We will continue to evaluate whether these actions work to resolve the issue.
We will provide an update in 60 minutes or sooner if additional information becomes available.
Recent mitigation efforts, including the addition of two worker PODs and a rolling restart, have helped stabilize processing; however, the incident trigger is still being investigated, and the issue continues to have an impact and exhibit signs of regression. We are monitoring the volume of queued jobs and have discovered that some jobs are stuck in processing, consuming slots.
As of 16:34 UTC, a second rolling restart has been started. This is expected to be completed in 30 to 45 minutes.
We will provide an update in 60 minutes or sooner if additional information becomes available.
We have completed the rolling restart of all worker nodes and are actively monitoring to ensure the service continues to recover as expected. Additionally, we are monitoring the volume of the queued jobs, and they are gradually reducing to a healthier level. customers may continue to see an impact as the job backlog clears.
Each customer org also has built-in concurrency thresholds based on licensing. Due to these limitations, customers may observe jobs remaining in a queued state even when no jobs appear to be actively running. This is expected behaviour and will continue for several hours as queued jobs are processed in order.
Most job types, such as recipes and dataflows, should clear quickly once workers are healthy. However, some of these jobs depend on Data Sync completing first, which may extend their processing time. Customers do not need to cancel or restart any jobs; the system will automatically execute them as capacity becomes available.
During this monitoring period, some scheduled Data Sync runs may temporarily fail with “Job already in queue” errors. These errors will not impact the jobs that are queued.. Customers may optionally unschedule affected connections to prevent repeated scheduling attempts, but this is not required for stability or recovery.
We will continue to monitor worker health and job throughput for the next few hours until all metrics consistently reflect normal, healthy performance.
We are continuing to monitor the progress of the rolling restarts across the impacted instances. We expect to complete this within the next 30 minutes. Early results show improvement, logs confirm an increasing number of jobs completing successfully, and validation workloads are now running to completion on the restarted worker nodes.
We'll provide an update in 30 minutes or sooner if additional information becomes available.
Investigations have highlighted that the impact radius is broader than initially understood. We have updated the posting to reflect the additional instances experiencing impact, and customers in these instances will receive communications related to this incident going forward.
Investigations have highlighted that the impact radius is broader than initially understood. We have updated the posting to reflect the additional instances experiencing impact, and customers in these instances will receive communications related to this incident going forward.
Upon further investigation, we have determined that the start time of impact is different than initially understood. We have now revised the start time of the Trust post to more accurately reflect the time customers may have begun to experience impact. We apologize for any confusion caused by this.
We have completed the rollback on a test instance; however, post-rollback validation shows that jobs are still not progressing, indicating that this action did not resolve the underlying issue. We are now analysing system metrics and job-processing telemetry. Early signals highlight an increase in “workers are unhealthy or fail to accept job” errors, which aligns with the observed job failures.
The team is preparing an additional rollback to an earlier version. We'll provide an update in 30 minutes or sooner if additional information becomes available.
We're investigating reports that data synchronization jobs for CRM Analytics (CRMA) are not processing as expected. As a result, a subset of customers may experience failures or delays in CRMA data sync operations and recipe jobs, which may appear stuck in the queue or fail to complete.
We have identified a recent deployment as the potential trigger for the issue. We are performing a rollback on one of the impacted instances in an effort to resolve the issue.
We'll provide an update in 30 minutes or sooner if additional information becomes available.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 4600 services available
Integrations with