Outage in Georgia Tech IT

Cooling issue in the Research Hall in CODA affecting Pace equipment

Resolved Major
September 08, 2024 - Started 20 days ago - Lasted 3 days

Need to monitor Georgia Tech IT outages?
Stay on top of outages with IsDown. Monitor the official status pages of all your vendors, SaaS, and tools, including Georgia Tech IT, and never miss an outage again.
Start Free Trial

Outage Details

The Data Center Operations and the Databank team are investigating a sudden failure in the cooling systems of the Research hall causing inability to cool. The teams are investigating the issue at this time, more information will be provided as it comes in.
Components affected
Georgia Tech IT Academic Services
Latest Updates ( sorted recent to last )
20 days ago - at 09/08/2024 10:45AM

The Data Center Operations and the Databank team are investigating a sudden failure in the cooling systems of the Research hall causing inability to cool. The teams are investigating the issue at this time, more information will be provided as it comes in.

20 days ago - at 09/08/2024 11:42AM

The Data Center Operations and the Databank team are investigating a sudden failure in the cooling systems of the Research hall causing inability to cool. The teams are investigating the issue at this time, more information will be provided as it comes in.

20 days ago - at 09/08/2024 12:26PM

The Data Center Operations and the Databank team are investigating a sudden failure in the cooling systems of the Research hall causing inability to cool. The teams are investigating the issue at this time, more information will be provided as it comes in.

20 days ago - at 09/08/2024 01:28PM

The Data Center Operations and the Databank team are investigating a sudden failure in the cooling systems of the Research hall causing inability to cool. The teams are investigating the issue at this time, more information will be provided as it comes in.

20 days ago - at 09/08/2024 01:52PM

WHAT’S HAPPENING? 
Due to an emergency with a cooling system at the Research Hall, all PACE clusters had to be shut down on the morning of Sunday, September 8, 2024.

Access to login nodes and filesystems (via Globus, OpenOndemand or director connection to login nodes) is still available.

WHEN IS IT HAPPENING? 
Sunday, September 8, 2024, starting at 7.30 AM.EDT.

WHY IS IT HAPPENING? 
PACE have been notified by IOC that the temperatures in the CODA building Research Hall are rising due to a failure of a water pump in the cooling system. Emergency shutdown had to be executed in order to protect equipment. The physical infrastructure provider for our datacenter is working on evaluating the situation.

WHO IS AFFECTED? 
All PACE Users. Any running jobs on ALL PACE Clusters (Phoenix, Hive, Firebird, ICE, and Buzzard) had to be stopped at 7.30 AM. For Phoenix and Firebird, we will provide refunds for interrupted jobs on paid accounts only by default. Please let us know if this causes a significant loss of funds resulting in inability to continue work on your free-tier Phoenix allocation!  

WHAT DO YOU NEED TO DO? 
Wait patiently; we will communicate as soon as the clusters are ready to resume work. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 
For any questions, please contact PACE at pace-support@oit.gatech.edu.

20 days ago - at 09/08/2024 02:24PM

The Databank team have identified the problem and estimating a time for repairs.

20 days ago - at 09/08/2024 02:50PM

Due to a failure with the Data Center cooling system for the the Research Hall, all PACE cluster had to be shut down on the morning of Sunday, September 8, 2024. The Databank team have identified the problem and are working on the repairs. More update will be provided as we get an estimated time for repairs.

20 days ago - at 09/08/2024 06:04PM

Due to a failure with the Data Center cooling system for the the Research Hall, all PACE cluster had to be shut down on the morning of Sunday, September 8, 2024. The Databank team have identified the problem and are working on the repairs. More update will be provided as we get an estimated time for repairs.

19 days ago - at 09/09/2024 01:03PM

Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. While a time frame for resolution is currently unknown, we are actively working with the vendor, Data Bank, to resolve the issue and restore service to the data center as soon as possible. We will provide updates as they are available.

19 days ago - at 09/09/2024 09:54PM

Due to an emergency with a cooling system at the Research Hall, all PACE clusters have been shut down since the morning of Sunday, September 8, 2024. The datacenter provider, Data Bank, has identified an alternate replacement part which has been brought onsite and is in the process of being deployed/tested. At this time, we estimate that Data Bank will have restored cooling
to the Research Hall by Tuesday, September 10, 2024, by close of business day. At which point, PACE will begin powering up, testing infrastructure and begin the process to bring services back online. We plan to provide additional updates on the restoration of services by Wednesday, September 11, 2024, evening.

17 days ago - at 09/11/2024 12:54PM

During the process of restoring cooling, our data center hosting provider, DataBank, identified additional critical parts that were damaged and had to be replaced. Cooling was restored at 8:43 pm on Tuesday, September 10, 2024, and monitored throughout the night. DataBank gave an all-clear to PACE at 6:00 am on Wednesday, September 11, 2024, to bring systems back online. PACE has started powering up, testing infrastructure, and bringing clusters back online. Updates will be provided throughout the day as services are progressively restored.

17 days ago - at 09/11/2024 02:15PM

ICE has returned to production, and compute jobs can run. Work continues on the research clusters.

Both nodes with AMD MI210 GPUs remain under repair after failing last week. All other ICE node architectures are available.

17 days ago - at 09/11/2024 02:48PM

Hive and Firebird have returned to production, and compute jobs have resumed. Work continues on Phoenix and Buzzard.

17 days ago - at 09/11/2024 03:11PM

Buzzard has returned to production, and compute jobs have resumed. Work continues on Phoenix.

Cut Vendor Outage Costs with an Internal Status Page

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 3242 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook

Setup in 5 minutes or less

How much time you'll save your team, by having the outages information close to them?

14-day free trial · No credit card required · Cancel anytime