The compute-1 GPFS restart has now been completed and the associated waiter has been cleared. All nodes are now available to Slurm
Compute-4 has now been restarted, and the storage side deadlock has now been cleared. Compute-1 has a different waiter problem so is still draining until we can restart GPFS there. We will continue to manage and communicate that status via this status page. All other compute nodes are now available
The deadlock on compute-3 has now been cleared, the node is available in Slurm
Compute-3 is now stuck in a completing state so we are going to attempt a restart of GPFS there. Any jobs still running there will unfortunately be killed
Compute-[1-4] are all being affected by a long GPFS waiter on the storage cluster. However Slurm jobs continue to run there so we are attempting to resolve the issue without killing all the jobs. We need to restart GPFS on those nodes, so we are currently draining compute-1 and -4 as a first step. If the situation deteriorates further we may be forced to kill all jobs on those nodes so we can restart GPFS on all four nodes.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6020 services available
Integrations with