Mitigated - Azure Resource Manager - Impact to multiple services in China North 3
Resolved
Minor
September 05, 2024 - Started about 1 month ago
- Lasted about 14 hours
Need to monitor Azure outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Azure, and never miss an outage again.
Start Free Trial →
Outage Details
What happened?Between around 08:00 and 21:30 CST on 05 September 2024, an Azure Resource Manager (ARM) platform issue resulted in an impact to the following services in the China North 3 region:Azure Databricks – Customers may have encountered errors or failures while submitting job requests.Azure Data Factory – Customers may have experienced Internal Server Errors while running Data Flow Activity or acquiring Data Flow Debug Sessions.Azure Database for MySQL - Customers making create/update/delete operations to Azure Database for MySQL in China North 3 would have not seen requests completing as expected.Azure Kubernetes Service - Customers making Cluster Management operations such as but not limited to: scaling out, updates, creating/deleting clusters, would not have seen requests completing as expected.Microsoft Purview - Customers making create/update/delete operations to Microsoft Purview Account in China North 3 resources in China North 3 would have not seen requests completing as expected.Event Hub/Service Bus - Customers attempting to perform read or write operations may have seen slow response times.Other services leveraging Azure Resource Manager - Customers may have experienced service management operation failures if using services inside of Resource Groups in China North 3. This has now been mitigated. What went wrong and why?We determined that a latent code defect in Azure Resource Manager resulted in a critical component crashing, as well as being unable to successfully process asynchronous jobs for customers, resulting in delays and failures for deployments, resource deletion and other long running control plane operations.Services which depend upon Azure Resource Manager to provide internal functionality, such as Azure Databricks, Azure Data Factory, Azure Kubernetes Service, and others listed above, were impacted by these delays and failures. How did we respond?08:00 CST on 05 September 2024 – Customer impact began.11:00 CST on 05 September 2024 – We attempted to increase the worker instance count to mitigate the incident, but were not successful.13:05 CST on 05 September 2024 – We continued to investigate and try and find mitigating factors leading to impact.16:02 CST on 05 September 2024 – We identified a correlation between specific release versions and increased error processing latency, and determined a safe rollback version.16:09 CST on 05 September 2024 – We began rolling back to the previous known good build for one component in China East 3 to validate our mitigation approach.16:55 CST on 05 September 2024 – We started to see indications that mitigation was working as intended.17:36 CST on 05 September 2024 – We confirmed mitigation for China East 3 and China North 3, and began rolling back to the safe version in other regions using a safe deployment process which took several hours.21:30 CST on 05 September 2024 – The rollback was completed and after further monitoring, we are confident that service functionality has been fully restored, and customer impact mitigated. What happens next?Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts .For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs .The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring .Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness .