Need to monitor Georgia Tech IT outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Georgia Tech IT, and never miss an outage again.
Start Free Trial
A cooling controller failed at the data center. Shutting down PACE clusters.
All Hive nodes are powered off. All jobs failed.
All Buzzard nodes are powered off. All jobs failed (though presumably requeued).
All new jobs on Phoenix are held.
All idle nodes on Phoenix are being turned off.
All Firebird nodes are powered off. All jobs failed.
The controller for the system providing cooling to nodes in the Coda Research Hall has failed. To avoid damage, PACE has urgently shut down many compute nodes to reduce heat.
Due to continued high temperatures, all Phoenix compute nodes have been turned off, and all running jobs were cancelled. Impacted jobs will be refunded at the end of April.
Water pump controller failed, affecting the cooling of the research hall. Support vendor has been engaged and is assessing the situation.
It has been determined that our water pump controller will need to be replaced, and we are currently coordinating with the support vendor on this replacement process.
Our vendors are working to restore cooling capabilities to the datacenter by fully replacing the cooling system controller and expect to have the work completed by 7:00pm ET.
We hope to return all systems to service by tomorrow (Friday) evening, provided that all repairs to the cooling system are complete and after testing for stability after the shutdown. Clusters will be released as testing is completed for each system.
Some compute nodes on ICE were accidentally powered off last night, which may have impacted some running jobs. We have restored a partial selection of those nodes to service so that all hardware types are available.
There was a brief pause in the scheduler this morning from 9:17am to 9:41am, which may have prevented jobs from starting during that time. Most ICE compute nodes are currently available for course usage.
The controller for the system providing cooling to nodes in the Coda Research Hall has been restored and we have returned to the HTCP lineup and are in normal operation.
The clusters are being powered up and tested. They will be returned to service as soon as they are ready. Updating the status back to "service disruption".
ICE cluster released for user workloads.
Hive cluster released for user workloads.
Firebird cluster released for user workloads.
Phoenix and Buzzard clusters released for user workloads.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 3949 services available
Integrations with