Outage in Workspot

Azure Network Infrastructure - Issues accessing a subset of Microsoft services

Resolved Minor

July 30, 2024 - Started over 1 year ago - Lasted 8 days
Official incident page

Incident Report

Microsoft updated its status page that they are investigating reports of issues connecting to Microsoft services globally. Customers may experience timeouts connecting to Azure services. Please refer to https://azure.status.microsoft/en-us/status for the latest update.

Need to monitor Workspot outages?

Monitor all your external dependencies in one place
Get instant alerts when outages are detected
Be the first to know if service is down
Show real-time status on private or public status page
Keep your team informed

Start monitoring for free

Latest Updates ( sorted recent to last )

RESOLVED over 1 year ago - at 08/07/2024 04:30PM

The issue has been mitigated. Microsoft have shared the below preliminary post-incident review regarding this incident (Tracking ID: KTY1-HW8):

This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a "Final" PIR with additional details/learnings.

What happened?

Between 11:45 and 13:58 UTC on 30 July 2024, a subset of customers experienced intermittent connection errors, timeouts, or latency spikes while connecting to Microsoft services that leverage Azure Front Door (AFD) and Azure Content Delivery Network (CDN). The two main impacted services were Azure Front Door (AFD) and Azure Content Delivery Network (CDN), and downstream services that rely on these – including the Azure portal, and a subset of Microsoft 365 and Microsoft Purview services. From 13:58 to 19:43 UTC, a smaller set of customers continued to observe a low rate of connection timeouts.

What went wrong and why?

Azure Front Door (AFD) is Microsoft's scalable platform for web acceleration, global load balancing, and content delivery, operating in nearly 200 locations worldwide – including datacenters within Azure regions, and edge sites. AFD and Azure CDN are built with platform defenses against network and application layer Distributed Denial-of-Service (DDoS) attacks. In addition to this, these services rely on the Azure network DDoS protection service, for the attacks at the network layer. You can read more about the protection mechanisms at https://learn.microsoft.com/azure/ddos-protection/ddos-protection-overview and https://learn.microsoft.com/azure/frontdoor/front-door-ddos.

Between 10:15 and 10:45 UTC, a volumetric distributed TCP SYN flood DDoS attack occurred at multiple Azure Front Door and CDN sites. This attack was automatically mitigated by the Azure Network DDoS protection service and had minimal customer impact.

At 11:45 UTC, as the Network DDoS protection service was disengaging and resuming default traffic routing to the Azure Front Door service, the network routes could not be updated within one specific site in Europe. This happened because of Network DDoS control plane failures to that specific site, due to a local power outage. Consequently, traffic inside Europe continued to be forwarded to AFD through our DDoS protection services, instead of returning directly to AFD. This event in isolation would not have caused any impact.

However, an unrelated latent network configuration issue caused traffic from outside Europe to be routed to the DDoS protection system within Europe. This led to localized congestion, which caused customers to experience high latency and connectivity failures across multiple regions. The vast majority of the impact was mitigated by 13:58 UTC, around two hours later when we resolved the routing issue. A small subset of customers without retry logic in their application may have experienced residual effects until 19:43 UTC.

How did we respond?

Our internal monitors detected impact on our Europe edge sites at 11:47 UTC, immediately prompting a series of investigations. Once we identified that the network routes could not be updated within that one specific site, we updated the DDoS protection configuration system to avoid traffic congestion. These changes successfully mitigated most of the impact by 13:58 UTC. Availability returned to pre-incident levels by 19:43 UTC once the default network policies were fully restored.

How we are making incidents like this less likely or less impactful

- We have already added the missing configuration on network devices to ensure a DDoS mitigation issue in one geography cannot spread to other geographies in the Europe region which resulted in traffic redirection. (Completed)
- We are enhancing our existing validation and monitoring in the Azure network, to detect invalid configurations. (Estimated completion: November 2024)
- We are improving our monitoring where our DDoS protection service is unreachable from the control plane, but is still serving traffic. (Estimated completion: November 2024)
- This is our Preliminary PIR that we endeavor to publish within 3 days of incident mitigation to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a "Final" PIR with additional details/learnings.

How can customers make incidents like this less impactful

- For customers of Azure Front Door/Azure CDN products, implementing retry logic in your client-side applications can help handle temporary failures when connecting to a service or network resource during mitigations of network layer DDoS attacks. For more information, refer to our recommended error-handling design patterns: https://learn.microsoft.com/azure/well-architected/resiliency/app-design-error-handling#implement-retry-logic.
- Applications that use exponential-backoff in their retry strategy may have seen success, as an immediate retry during intervals of high packet loss may have also seen high packet loss. A retry conducted during periods of lower loss would likely have succeeded. For more details on retry patterns, refer to https://learn.microsoft.com/azure/architecture/patterns/retry.
- More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency.
- Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/KTY1-HW8

MONITORING over 1 year ago - at 07/30/2024 03:57PM

As per Microsoft, they have implemented networking configuration changes, telemetry shows improvement in service availability.

INVESTIGATING over 1 year ago - at 07/30/2024 03:57PM

Microsoft updated the status page with the below details:
We have implemented networking configuration changes and have performed failovers to alternate networking paths to provide relief. Monitoring telemetry shows improvement in service availability from approximately 14:10 UTC onwards, and we are continuing to monitor to ensure full recovery.

INVESTIGATING over 1 year ago - at 07/30/2024 01:36PM

Microsoft updated its status page that they are investigating reports of issues connecting to Microsoft services globally. Customers may experience timeouts connecting to Azure services.

Please refer to https://azure.status.microsoft/en-us/status for the latest update.

Latest Workspot outages

Workspot Control Experiencing Intermittent Availability - about 1 month ago

Workspot Control Login Instability - 4 months ago

AWS Operational Issues - No Impact - Workspot is closely monitoring the situation - 5 months ago

Users are experiencing Authentication failures while using Workspot Windows Client. - 5 months ago

Failure in Azure East US and East US2 region - 6 months ago

The Status Page Aggregator with Early Outage Detection

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 6020 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook