Resource creation in Australia East
Resolved
Minor
November 03, 2025 - Started about 18 hours ago
- Lasted about 13 hours
Incident Report
Impact StatementStarting at 16:05 UTC on 3 November 2025, we began investigating an issue in the Australia East region that is affecting the ability to create new Virtual Machines (VMs). This issue has downstream impact on services that rely on VM creation, including Azure Backup, Azure Synapse Analytics, Azure Virtual Machines, and Azure Storage. Customers attempting to rename or modify any disk operations (including renaming resources) may also experience failures. Azure Databricks customers may experience intermittent delays or failures when launching or upsizing all‑purpose compute resources and submitting jobs. Existing VMs and running resources are not impacted.Current StatusIn this context, a pool manager is part of the orchestration layer responsible for resource creation. The issue first surfaced with one pool manager in the region, where we observed persistent allocation errors. Similar failures have since been detected across multiple pool managers. While several pool managers remain healthy and some allocations continue to succeed, retries are likely to fail due to the current load distribution design.Initial mitigation steps included restarting affected pool managers, purging cache, and moving partitions to alternate nodes. These actions did not resolve the issue. We also evaluated recent deployments and have ruled them out as a contributing factor. The service that manages these create requests has built-in resiliency to protect against this class of failure. However, during the investigation, we discovered a recent bug causing a data format inconsistency, which introduced an incompatibility between datasets and removed a layer of resiliency. We are evaluating options to deploy a hotfix to mitigate the bug, in parallel with continued analysis to determine the trigger event.We are now focusing on canceling stuck transactions that are contributing to request queue saturation and throttling, with the goal of freeing up resources and accelerating recovery. Part of this work involves ensuring internal services persist resource creation requests, which will help reduce pressure on the pool managers.While we do not yet have an estimated time of recovery (ETA), we will provide an update within 60 minutes, or sooner as new information becomes available.
One place to monitor all your cloud vendors. Get instant alerts when an outage is detected.
Latest Microsoft Azure outages