Trusted by 1,000+ teams
Stop finding out about outages from your users. Monitor 6,320+ cloud services and get alerted the second something breaks.
The cooling system in the Coda datacenter research hall has failed. All PACE compute nodes (Phoenix, Firebird, and ICE) have been shut down to avoid overheating. All running jobs have been cancelled, and no new jobs can start. Storage remains available via Globus, login nodes, and OnDemand.
Summary: The cooling system in the Coda datacenter research hall has failed. All PACE compute nodes (Phoenix, Firebird, and ICE) have been shut down to avoid overheating. All running jobs have been cancelled, and no new jobs can start. Storage remains available via Globus, login nodes, and OnDemand.
Please visit https://status.gatech.edu/ for ongoing updates.
Details: The high-temperature cooling tower in the Coda datacenter, which provides cooling to the research hall hosting all PACE compute nodes, has failed. All jobs have been cancelled. To avoid overheating and damage to the systems, all PACE compute nodes have been shut down. The enterprise hall, hosting login and storage nodes, remains cooled. Investigation of the issue is ongoing.
Impact: All running jobs have been cancelled. Refunds will be issued for any job on Phoenix or Firebird cancelled due to this failure. Login nodes and storage remain available. There is no impact to CEDAR storage.
Current actions:
The data center team is actively investigating and working to restore cooling
PACE is monitoring system temperatures closely
We are proactively reducing thermal load
Reservations are in place for Phoenix, ICE and Firebird
An emergency shutdown procedure is underway
What you should do:
Please avoid submitting new jobs until further notice
Save work and ensure checkpointing is enabled where possible
Monitor the PACE Blog and your email for updates before resuming normal workloads
Next update:
We will provide updates as more information becomes available or if service status changes.
Cooling has been restored to the datacenter. The PACE team is powering on all clusters and will complete validation testing before releasing the systems.
Cooling has been restored to the Coda datacenter after a valve repair. ICE has now returned to service after testing. Phoenix & Firebird are being prepared for resumed service.
Please resubmit any jobs that were cancelled due to the outage.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6320 services available
Integrations with