Trusted by 1,000+ teams
Stop finding out about outages from your users. Monitor 6,320+ cloud services and get alerted the second something breaks.
This incident has been resolved.
GPFS has been restarted on compute-4. This has cleared the long waiters. compute-4 has now been resumed in Slurm
At AgR request we are killing the jobs on compute-4 and will restart GPFS there shortly. This is an attempt to clear the long waiters and reinstate compute-4 to Slurm
We have found long waiters and a deadlock on compute-4. The node is being drained in Slurm now, in preparation for a GPFS restart there, maybe tomorrow. If further issues develop we made need to force that restart, which will kill all the jobs running on compute-4.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6320 services available
Integrations with