Trusted by 1,000+ teams
Stop finding out about outages from your users. Monitor 6,320+ cloud services and get alerted the second something breaks.
This incident has been resolved.
We are continuing to monitor for any further issues.
The compute nodes have once again been resumed in Slurm.
Compute nodes are draining again. Investigation is underway
We found numerous long waiters and defunct java processes across all nodes. Restarting GPFS on gpu-0, which had drained, cleared all the deadlocks, waiters, and defunct processes.
All compute nodes have been checked and now resumed in Slurm
compute-[0-4],gpu-0,vgpu-2 nodes are all in a draining state due to "Kill task failures". The nodes will be investigated and resumed in Slurm shortly
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6320 services available
Integrations with