Outage in Azure

Mitigated - Managed Identity - Australia East, Australia Southeast

Resolved Minor
July 12, 2024 - Started 5 months ago - Lasted about 7 hours

Need to monitor Azure outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Azure, and never miss an outage again.
Start Free Trial

Outage Details

What happened?Between at 00:55 UTC on 12 Jul 2024 and 06:28 UTC on 12 Jul 2024, you have been identified among a subset of customers using Managed Service Identity (MSI) for Azure resources who may experience failures when requesting tokens for managed identities associated with Virtual Machines or Virtual Machine Scale Sets, Windows Virtual Desktop, Azure Databricks and any other Azure service that relies on MSI. What do we know so far?We identified that a configuration change introduced in a recent deployment had caused this issue. We had to roll back the new change and restart to the last known good build.How did we respond?00:55 UTC on 12 July 2024 – Customer impact began.01:10 UTC on 12 July 2024 – Service monitoring detected decreasing availability on some storage scale units in the region.01:14 UTC on 12 July 2024 – our team engaged and started the investigation.02:58 UTC on 12 July 2024 – Recent configuration change was identified and we started a deployment to roll back the change.06:05 UTC on 12 July 2024 – We completed rolling back on one Availability Zone (AZ) and verified that our telemetry looks good on this AZ and started with other AZs. We also failed over the other availability zones where we were seeing signs of impact.06:23 UTC on 12 July 2024 – Service started to recover and customers should start seeing recovery at this point of time. We continue to apply recovery operations and monitoring recovery.06:28 UTC on 12 July 2024 – Rollback completed, and service showed full recovery from platform side. (customers may benefit recycling service if they are not fully mitigated) What happens next?Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/MonitoringTo get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alertsFor more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRsFinally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness 

Start monitoring all your vendors in just 5 minutes

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3278 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime