Outage in Georgia Tech IT

Datacenter Power Outage

Resolved Major
October 02, 2024 - Started 3 months ago - Lasted 1 day

Need to monitor Georgia Tech IT outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Georgia Tech IT, and never miss an outage again.
Start Free Trial

Outage Details

A short power outage impacted the Coda datacenter and surrounding areas at approximately 11:37 AM, leading PACE compute nodes to lose power. The PACE team is now investigating the situation and working to restore compute nodes on all PACE clusters (Phoenix, Hive, Firebird, ICE, and Buzzard). While a few compute nodes are on, no new jobs can start, and many will have been disrupted.
Access to login nodes and storage remains available due to backup power.
Components affected
Georgia Tech IT Academic Services
Latest Updates ( sorted recent to last )
3 months ago - at 10/02/2024 04:14PM

A short power outage impacted the Coda datacenter and surrounding areas at approximately 11:37 AM, leading PACE compute nodes to lose power. The PACE team is now investigating the situation and working to restore compute nodes on all PACE clusters (Phoenix, Hive, Firebird, ICE, and Buzzard). While a few compute nodes are on, no new jobs can start, and many will have been disrupted.
Access to login nodes and storage remains available due to backup power.

3 months ago - at 10/02/2024 04:59PM

Dear PACE users,
A power outage (related to Georgia Power) impacted the Tech Square including the CODA Datacenter. Due to a secondary failure of the UPS system, all PACE clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) were impacted. Currently, most of the nodes on all clusters are powered off, and the schedulers on all clusters have been paused. The outage started at approximately 11:37 am this morning. At the moment, no new jobs can start, and large number of jobs that have been running when the outage started have been terminated. Access to login nodes and storage remains available due to backup power. We are actively monitoring the situation and will keep you updated on the progress of the restoration of services.
Thank you for your patience,
The PACE team

3 months ago - at 10/02/2024 09:01PM

The ICE Cluster has been fully powered on, tested, and released for access in order to prioritize educational resources.

PACE and the OIT Datacenter teams are in the process of bringing up machines that make up the research clusters. Due to the sudden nature of the outage, the usual recovery mechanisms for rapid power-up are not available, which is considerably slowing recovery efforts compared to previous outages. The PACE and OIT Datacenter teams are continuing to check, manually reset, power on and subsequently test the hundreds of nodes that have been left in a bad state due to the nature of this power outage. Our tests have currently covered slightly over 1/5th of our 2,100 machines, and we expect to continue working to bring all machines online through the following day and will provide updates as we’re able to release clusters.

3 months ago - at 10/03/2024 01:05PM

PACE and the OIT Datacenter teams have brought up the vast majority of machines making up the PACE clusters. Roughly 100 nodes remain in a state requiring manual intervention out of our 2,100 machines. The PACE team is working to confirm hardware readiness and beginning to carry out test procedures prior to releasing the clusters. Further updates will be provided as clusters become available for use.

The PACE team is prioritizing the Phoenix and Hive clusters, followed by Firebird and Buzzard. We hope to have the full suite of systems released by mid-afternoon.

3 months ago - at 10/03/2024 01:57PM

The Hive cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Hive are available for use.
PACE continues to investigate 21 CPU nodes, 10 “nvme” nodes, and 4 “himem” nodes on Hive for errors and will return those to service as soon as possible.

The PACE Team is continuing to test the Phoenix, Firebird, and Buzzard clusters, in that order of priority.

3 months ago - at 10/03/2024 04:00PM

The Phoenix cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Phoenix are available for use.
PACE continues to investigate 54 nodes which we were unable to power on remotely after the outage, which includes 19 NVIDIA V100 GPU nodes.

Reimbursements will be provided for all paid jobs impacted by the power and cooling outages this week. We will provide the details for reimbursement of paid storage to affected users later this week.

We are also doubling the amount of credits for ALL free-tier accounts on Phoenix for the month of October to offset the impacts of these outages. All Georgia Tech free-tier accounts (starting with gts-) will have the balance of $136 for the month of October; all GTRI free-tier accounts (starting with gtris-) will have the balance of $504.

3 months ago - at 10/03/2024 04:01PM

The Firebird cluster has been fully tested and is available for use. The scheduler has been un-paused and all queued jobs have resumed. Both RHEL7 and RHEL9 environments on Firebird are available for use.
All Firebird nodes are back in service.

Start monitoring all your vendors in just 5 minutes

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3279 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Never again lose time looking in the wrong place

14-day free trial · No credit card required · No code required