Outage in Microsoft Azure

Mitigated - Azure Resource Manager - Impact to multiple services in China North 3

Resolved Minor

September 04, 2024 - Started almost 2 years ago - Lasted about 14 hours

Incident Report

What happened?Between around 08:00 and 21:30 CST on 05 September 2024, an Azure Resource Manager (ARM) platform issue resulted in an impact to the following services in the China North 3 region:Azure Databricks – Customers may have encountered errors or failures while submitting job requests.Azure Data Factory – Customers may have experienced Internal Server Errors while running Data Flow Activity or acquiring Data Flow Debug Sessions.Azure Database for MySQL - Customers making create/update/delete operations to Azure Database for MySQL in China North 3 would have not seen requests completing as expected.Azure Kubernetes Service - Customers making Cluster Management operations such as but not limited to: scaling out, updates, creating/deleting clusters, would not have seen requests completing as expected.Microsoft Purview - Customers making create/update/delete operations to Microsoft Purview Account in China North 3 resources in China North 3 would have not seen requests completing as expected.Event Hub/Service Bus - Customers attempting to perform read or write operations may have seen slow response times.Other services leveraging Azure Resource Manager - Customers may have experienced service management operation failures if using services inside of Resource Groups in China North 3. This has now been mitigated. What went wrong and why?We determined that a latent code defect in Azure Resource Manager resulted in a critical component crashing, as well as being unable to successfully process asynchronous jobs for customers, resulting in delays and failures for deployments, resource deletion and other long running control plane operations.Services which depend upon Azure Resource Manager to provide internal functionality, such as Azure Databricks, Azure Data Factory, Azure Kubernetes Service, and others listed above, were impacted by these delays and failures. How did we respond?08:00 CST on 05 September 2024 – Customer impact began.11:00 CST on 05 September 2024 – We attempted to increase the worker instance count to mitigate the incident, but were not successful.13:05 CST on 05 September 2024 – We continued to investigate and try and find mitigating factors leading to impact.16:02 CST on 05 September 2024 – We identified a correlation between specific release versions and increased error processing latency, and determined a safe rollback version.16:09 CST on 05 September 2024 – We began rolling back to the previous known good build for one component in China East 3 to validate our mitigation approach.16:55 CST on 05 September 2024 – We started to see indications that mitigation was working as intended.17:36 CST on 05 September 2024 – We confirmed mitigation for China East 3 and China North 3, and began rolling back to the safe version in other regions using a safe deployment process which took several hours.21:30 CST on 05 September 2024 – The rollback was completed and after further monitoring, we are confident that service functionality has been fully restored, and customer impact mitigated. What happens next?Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts .For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs .The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring .Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness .

Components affected

Microsoft Azure *China Non-Regional Microsoft Azure China North 3 Microsoft Azure Azure Databricks Microsoft Azure Azure Data Factory Microsoft Azure Microsoft Purview

Trusted by 1,000+ teams

The Status Page Aggregator with Early Outage Detection

Stop finding out about outages from your users. Monitor 6,320+ cloud services and get alerted the second something breaks.

Start Free Trial Learn More

Latest Microsoft Azure outages

GitHub Issue - 10 days ago

Mitigated – Customers leveraging a subset of services may experience 401 authentication errors - about 1 month ago

Active – Customers leveraging a subset of services may experience 401 authentication errors - about 1 month ago

Investigating a spike in 401 authentication errors - about 1 month ago

Active – Network connectivity degradation in West US 2 - about 2 months ago

The Status Page Aggregator with Early Outage Detection

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 6320 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook