Outage in Azure

Service Management Operation Errors Across Azure Services in East US 2

Resolved Minor
April 08, 2022 - Started about 2 years ago - Lasted about 7 hours

Need to monitor Azure outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Azure, and never miss an outage again.
Start Free Trial

Outage Details

Impact Statement: Starting at approximately 12:25 UTC on 08 Apr 2022, customers running services in the East US 2 region may be experiencing service management errors, delays, and/or timeouts. We are investigating an underlying issue causing GET and PUT errors impacting the Azure portal itself, as well as services including Azure Virtual Machines (VMs), Virtual Machine Scale Sets (VMSS), and additional downstream services. Customers may see errors including “The network connectivity issue encountered for Microsoft.Compute cannot fulfill the request.” Finally, for some downstream services that have auto-scale enabled, this service management issue may cause data plane impact.Current Status: The series of mitigation efforts described in earlier incident updates is still making progress in improving error rates. Internal services continue to report significant improvements in the proportion of requests that are succeeding. While mitigation is still being applied, the investigation into what is causing this incident has determined that the Compute Resource Provider (CRP) gateways in East US 2 are being overwhelmed with requests for compute resources. Mitigation workstreams continue to focus on how to prevent CRP gateways from becoming unhealthy. While the combination of restarts, scaling out, and traffic reduction initially helped some gateway nodes to return to a healthy state, and stay healthy, other gateway nodes are routinely getting into a condition of being overloaded by request volume. To resolve this, there are two mitigation workstreams being run in parallel – in the short term, we are investigating automation to restart gateway nodes on a regular basis to avoid getting into an unhealthy state. In the long term, we are investigating a CRP gateway hotfix that will obviate the need for restarts and prevent each node from becoming unhealthy. Both these work streams are making good progress. At this stage, we believe that we have eliminated impact to most of the downstream services and are working with each team to confirm mitigation. We are also working with the last couple of services to mitigate them.Although we believe that external customers and partners are continuing to see improvements, as mentioned we are not declaring mitigation until error rates return to pre-incident levels. While mitigation efforts continue, we will continue to provide hourly updates to ensure that all impacted customers and partners are informed of progress. The next update will be provided by 03:00 UTC, April 9th, or as soon as we have an update to share.

Start monitoring Azure and all your cloud vendors in minutes

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3153 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime