Mitigated - Managed Identity - Australia East, Australia Southeast
Resolved
Minor
July 12, 2024 - Started 4 months ago
- Lasted about 7 hours
Need to monitor Azure outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Azure, and never miss an outage again.
Start Free Trial →
Outage Details
What happened?Between at 00:55 UTC on 12 Jul 2024 and 06:28 UTC on 12 Jul 2024, you have been identified among a subset of customers using Managed Service Identity (MSI) for Azure resources who may experience failures when requesting tokens for managed identities associated with Virtual Machines or Virtual Machine Scale Sets, Windows Virtual Desktop, Azure Databricks and any other Azure service that relies on MSI. What do we know so far?We identified that a configuration change introduced in a recent deployment had caused this issue. We had to roll back the new change and restart to the last known good build.How did we respond?00:55 UTC on 12 July 2024 – Customer impact began.01:10 UTC on 12 July 2024 – Service monitoring detected decreasing availability on some storage scale units in the region.01:14 UTC on 12 July 2024 – our team engaged and started the investigation.02:58 UTC on 12 July 2024 – Recent configuration change was identified and we started a deployment to roll back the change.06:05 UTC on 12 July 2024 – We completed rolling back on one Availability Zone (AZ) and verified that our telemetry looks good on this AZ and started with other AZs. We also failed over the other availability zones where we were seeing signs of impact.06:23 UTC on 12 July 2024 – Service started to recover and customers should start seeing recovery at this point of time. We continue to apply recovery operations and monitoring recovery.06:28 UTC on 12 July 2024 – Rollback completed, and service showed full recovery from platform side. (customers may benefit recycling service if they are not fully mitigated) What happens next?Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/MonitoringTo get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alertsFor more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRsFinally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness