Outage in Georgia Tech IT

Data center cooling issues

Resolved Minor
April 02, 2025 - Started 15 days ago - Lasted 2 days

Need to monitor Georgia Tech IT outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Georgia Tech IT, and never miss an outage again.
Start Free Trial

Outage Details

A cooling controller failed at the data center. Shutting down PACE clusters.
Components affected
Georgia Tech IT Academic Services
Latest Updates ( sorted recent to last )
15 days ago - at 04/02/2025 09:08PM

A cooling controller failed at the data center. Shutting down PACE clusters.

15 days ago - at 04/02/2025 09:12PM

All Hive nodes are powered off. All jobs failed.
All Buzzard nodes are powered off. All jobs failed (though presumably requeued).
All new jobs on Phoenix are held.
All idle nodes on Phoenix are being turned off.
All Firebird nodes are powered off. All jobs failed.

15 days ago - at 04/02/2025 09:25PM

The controller for the system providing cooling to nodes in the Coda Research Hall has failed. To avoid damage, PACE has urgently shut down many compute nodes to reduce heat.

15 days ago - at 04/02/2025 09:51PM

Due to continued high temperatures, all Phoenix compute nodes have been turned off, and all running jobs were cancelled. Impacted jobs will be refunded at the end of April.

15 days ago - at 04/02/2025 10:25PM

Water pump controller failed, affecting the cooling of the research hall. Support vendor has been engaged and is assessing the situation.

15 days ago - at 04/03/2025 01:47AM

It has been determined that our water pump controller will need to be replaced, and we are currently coordinating with the support vendor on this replacement process.

14 days ago - at 04/03/2025 01:56PM

Our vendors are working to restore cooling capabilities to the datacenter by fully replacing the cooling system controller and expect to have the work completed by 7:00pm ET.

We hope to return all systems to service by tomorrow (Friday) evening, provided that all repairs to the cooling system are complete and after testing for stability after the shutdown. Clusters will be released as testing is completed for each system.

14 days ago - at 04/03/2025 03:07PM

Some compute nodes on ICE were accidentally powered off last night, which may have impacted some running jobs. We have restored a partial selection of those nodes to service so that all hardware types are available.
There was a brief pause in the scheduler this morning from 9:17am to 9:41am, which may have prevented jobs from starting during that time. Most ICE compute nodes are currently available for course usage.

14 days ago - at 04/04/2025 02:26AM

The controller for the system providing cooling to nodes in the Coda Research Hall has been restored and we have returned to the HTCP lineup and are in normal operation.

13 days ago - at 04/04/2025 12:08PM

The clusters are being powered up and tested. They will be returned to service as soon as they are ready. Updating the status back to "service disruption".

13 days ago - at 04/04/2025 01:12PM

ICE cluster released for user workloads.

13 days ago - at 04/04/2025 02:33PM

Hive cluster released for user workloads.

13 days ago - at 04/04/2025 02:59PM

Firebird cluster released for user workloads.

13 days ago - at 04/04/2025 03:36PM

Phoenix and Buzzard clusters released for user workloads.

Real-time vendor status monitoring for IT and Ops teams

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3949 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook