Outage in AWS

Connectivity Issues & API Errors

Resolved Minor
June 10, 2021 - Started over 3 years ago - Lasted 10 months

Need to monitor AWS outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including AWS, and never miss an outage again.
Start Free Trial

Outage Details

1:24 PM PDT We are investigating connectivity issues for some EC2 instances in a single Availability Zone (euc1-az1) in the EU-CENTRAL-1 Region.

1:55 PM PDT We can confirm increased API error rates and latencies for the EC2 APIs and connectivity issues for instances within a single Availability Zone (euc1-az1) within the EU-CENTRAL-1 Region, caused by an increase in ambient temperature within a subsection of the affected Availability Zone. Other Availability Zones within the EU-CENTRAL-1 Region are not affected by the issue and we continue to work towards resolving the issue.

2:36 PM PDT We continue to work on resolving the connectivity issues affecting EC2 instances in a single Availability Zone (euc1-az1) within the EU-CENTRAL-1 Region. Ambient temperatures within the affected subsection of the Availability Zone have begun to return to normal levels and we are working to recover the affected EC2 instances and networking devices within the affected Availability Zone. For the vast majority of affected EC2 instances, once network connectivity is restored, the instances will recover. A small number of EC2 instances may have power cycled as a result of the increased temperatures. While we continue to make progress in resolving the issue, we continue to recommend failing away to other Availability Zones in the region if you are able to do so.

3:26 PM PDT We continue to work on resolving the connectivity issues affecting EC2 instances in a single Availability Zone (euc1-az1) within the EU-CENTRAL-1 Region. While temperatures continue to return to normal levels, engineers are still not able to enter the affected part of the Availability Zone. We believe that the environment will be safe for re-entry within the next 30 minutes, but are working on recovery remotely at this stage. Once we have access to the affected subsection of the Availability Zone, we will be working towards restoring network connectivity and any EC2 instances that were impaired by the high ambient temperatures. We continue to recommend failing away to other Availability Zones in the region if you are able to do so.

4:12 PM PDT We continue to work on resolving the connectivity issues affecting EC2 instances in a single Availability Zone (euc1-az1) within the EU-CENTRAL-1 Region. Unfortunately, we continue to wait for environmental conditions to improve within the subsection of the affected Availability Zone where it is safe to send in engineers. Some EBS volumes are also experiencing degraded performance within the affected Availability Zone, but are expected to recover once network connectivity is restored. We are working to resolve the issue and will keep you updated on our progress.

4:33 PM PDT We have restored network connectivity within the affected Availability Zone and continue to work on full recovery of EC2 instances and EBS volumes. Customers should begin to see recovery at this stage.

5:19 PM PDT We have restored network connectivity within the affected Availability Zone in the EU-CENTRAL-1 Region. The vast majority of affected EC2 instances have now fully recovered but we’re continuing to work through some EBS volumes that continue to experience degraded performance. The environmental conditions within the affected Availability Zone have now returned to normal levels. We will provide further details on the root cause in a subsequent posts, but can confirm that there was no fire within the facility.

6:54 PM PDT Starting at 1:18 PM PDT we experienced connectivity issues to some EC2 instances, increased API errors rates, and degraded performance for some EBS volumes within a single Availability Zone in the EU-CENTRAL-1 Region. At 4:26 PM PDT, network connectivity was restored and the majority of affected instances and EBS volumes began to recover. At 4:33 PM PDT, increased API error rates and latencies had also returned to normal levels. The issue has been resolved and the service is operating normally. The root cause of this issue was a failure of a control system which disabled multiple air handlers in the affected Availability Zone. These air handlers move cool air to the servers and equipment, and when they were disabled, ambient temperatures began to rise. Servers and networking equipment in the affected Availability Zone began to power-off when unsafe temperatures were reached. Unfortunately, because this issue impacted several redundant network switches, a larger number of EC2 instances in this single Availability Zone lost network connectivity. While our operators would normally had been able to restore cooling before impact, a fire suppression system activated inside a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire. In order to recover the impacted instances and network equipment, we needed to wait until the fire department was able to inspect the facility. After the fire department determined that there was no fire in the data center and it was safe to return, the building needed to be re-oxygenated before it was safe for engineers to enter the facility and restore the affected networking gear and servers. The fire suppression system that activated remains disabled. This system is designed to require smoke to activate and should not have discharged. This system will remain inactive until we are able to determine what triggered it improperly. In the meantime, alternate fire suppression measures are being used to protect the data center. Once cooling was restored and the servers and network equipment was re-powered, affected instances recovered quickly. A very small number of remaining instances and volumes that were adversely affected by the increased ambient temperatures and loss of power remain unresolved. We continue to work to recover those last affected instances and volumes, and have opened notifications for the remaining impacted customers via the Personal Health Dashboard. For immediate recovery of those resources, we recommend replacing any remaining affected instances or volumes if possible.
Components affected
Amazon EC2 (eu-central-1)

Be the first to know when AWS and other third-party services go down

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3278 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime