Outage in Elastic Cloud

Some hosts in Azure Australia East zone 3 are unreachable

Resolved Minor
August 30, 2023 - Started over 1 year ago - Lasted about 15 hours
Official incident page

Need to monitor Elastic Cloud outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Elastic Cloud, and never miss an outage again.
Start Free Trial

Outage Details

Some hosts in Azure Australia East zone 3 are unreachable. We are observing degrade performance for clusters that having instances allocated in this AZ. We're currently investigating the issue and will provide further update within the next 30 minutes.
Latest Updates ( sorted recent to last )
RESOLVED over 1 year ago - at 08/31/2023 02:24AM

This incident has been resolved.

MONITORING over 1 year ago - at 08/31/2023 12:10AM

Update from Microsoft at 23:45 UTC (more details https://azure.status.microsoft/en-us/status):
Impact Statement: Starting at approximately 08:30 UTC on 30 August 2023, a utility power surge in the Australia East region tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. While working to restore cooling, temperatures in the datacenter increased so we proactively powered down a small subset of selected compute and storage scale units, to avoid damage to hardware. Multiple downstream services were impacted, with targeted communications being distributed via Azure Service Health.

Current Status: Having successfully recovered 99% of storage services and 99% of impacted Virtual Machines, we are actively investigating individual downstream services to confirm their recovery status and mitigate remaining issues. At this stage, we believe most downstream services that are still experiencing impact are the result of dependencies on one of three services with investigations ongoing. Firstly, our Storage team are making progress with the final remaining storage scale unit that is still experiencing isolated issues - we have engaged our onsite datacenter team to support replacing drives as needed. Secondly, our SQL team are working to mitigate one final cluster that is experiencing a capacity issue, due to several Service Fabric nodes that have not fully recovered - we are rebalancing capacity to mitigate. Finally, our Cosmos DB team continue to investigate why some services have not yet recovered fully. While the majority of customers and the majority of services are already mitigated, further updates on these remaining investigations will be provided in 60 minutes, or as events warrant.

On our side, 100% of the affected hosts are back up. We'll continue monitoring the situation and provide an update in the next 2 hours.

IDENTIFIED over 1 year ago - at 08/30/2023 10:44PM

Update from Microsoft at 22:28 UTC (more details https://azure.status.microsoft/en-us/status):
Impact Statement: Starting at approximately 08:30 UTC on 30 August 2023, a utility power surge in the Australia East region tripped a subset of the cooling units offline in one datacenter, within one of the Availability Zones. While working to restore cooling, temperatures in the datacenter increased so we proactively powered down a small subset of selected compute and storage scale units, to avoid damage to hardware. Multiple downstream services were impacted, with targeted communications being distributed via Azure Service Health. Impact to services is limited to Australia East, except for Azure Kubernetes Service (AKS) which has impact in both Australia East and Australia Southeast due to a dependency in the former.

Current Status: With 99% of storage services and 99% of impacted Virtual Machines back online and healthy, we are now supporting individual downstream services to confirm their recovery status. We are aware of one specific storage scale unit that is still experiencing isolated issues, but the majority of customers and services should already be recovered. Beyond this known storage issue, we are investigating which services are still not fully mitigated and why. Further updates will be provided in 60 minutes, or as events warrant.

On our side, 95% of the affected hosts are back up. There are still a handful of affected deployments that are running in a degraded state. Our teams are continuing efforts to restore service where possible. Next update will be provided in 2 hours or as soon as we have more to share.

IDENTIFIED over 1 year ago - at 08/30/2023 08:49PM

Update from Microsoft at 20:03 UTC (more details https://azure.status.microsoft/en-us/status):

We are in the final phases of restoring core services, and expect that the vast majority of remaining impacted services should be back online in the next hour. After restoring power and stabilizing temperatures, all network infrastructure and 99% of storage services are back online. All premium disk storage has fully recovered, we continue to work towards mitigating the final remaining storage devices. The vast majority of underlying compute services are back online, with more than 99% of Virtual Machines (VMs) that were impacted now back online and healthy.
While many customers and services have already recovered, we are now prioritizing our investigations with the remaining downstream impacted services. We expect that these remaining services should be back online and healthy within the next hour. Further updates will be provided in 60 minutes, or as events warrant.

On our side, 95% of the affected hosts are back up. There are still a handful of affected deployments that are running in a degraded state. Our teams are continuing efforts to restore service where possible. Next update will be provided in 2 hours or as soon as we have more to share.

IDENTIFIED over 1 year ago - at 08/30/2023 06:47PM

Update from Microsoft (more details https://azure.status.microsoft/en-us/status):

Mitigation efforts are continuing, we have made significant progress in restoring core services but we are not able to provide a mitigation ETA at this time. Power to all hardware has been restored, temperatures in the impacted datacenter have stabilized. All network infrastructure is back online. The majority of storage devices are back online, we are validating issues with a few remaining storage nodes. The majority of underlying compute services are back online, with more than 75% of Virtual Machines that were impacted back online and healthy. While many customers of these core services have seen signs of recovery, we continue to work with downstream impacted services to ensure that they are coming back online as expected.

On our side, we see that 30% of the affected hosts are back up. Our teams are continuing efforts to restore service where possible. Next update will be provided in 2 hours or as soon as we have more to share.

IDENTIFIED over 1 year ago - at 08/30/2023 05:36PM

Update from Microsoft (more details https://azure.status.microsoft/en-us/status):

Azure have indicated that the vast majority of network infrastructure is back online, and storage device recovery has started. Due to the nature of this issue, storage scale units are expected to require additional recovery efforts to ensure that all resources return in a consistent state. As service recovery continues, some customers may start experiencing signs of recovery.

All hosts that were affected by the outage are still affected. Our teams are continuing efforts to restore service where possible. Next update will be provided in 2 hours or as soon as we have more to share.

IDENTIFIED over 1 year ago - at 08/30/2023 03:23PM

Failed hosts are limited to Australia East zone 3. Kibana and Enterprise Search instances in this zone have been restored to zone 1 or 2 to mitigate impact for deployments of Elasticsearch that have instances in zone 1 or 2. Next update will be provided in 2 hours or as soon as we have more to share.

IDENTIFIED over 1 year ago - at 08/30/2023 01:51PM

Azure notified that temperature in the impacted datacenter have been stabilized. Azure engineers started to work on the restoration of Compute and Storage. More details in https://azure.status.microsoft/en-us/status . Next update will be provided in 2 hours or as soon as we have more to share.

IDENTIFIED over 1 year ago - at 08/30/2023 12:44PM

Azure engineers are reporting about cooling issues in azure-australiaeast. Azure engineers are actively working to mitigate the temperature issues in the datacenter. Currently there is no ETA to share for restoration of the impacted scale units.

IDENTIFIED over 1 year ago - at 08/30/2023 11:46AM

Azure has acknowledged an issue and is actively investigating.

INVESTIGATING over 1 year ago - at 08/30/2023 11:42AM

Some hosts in Azure Australia East zone 3 are unreachable. We are observing degrade performance for clusters that having instances allocated in this AZ.

We're currently investigating the issue and will provide further update within the next 30 minutes.

Be the first to know when Elastic Cloud and other third-party services go down

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3278 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime