Trusted by 1,000+ teams
Stop finding out about outages from your users. Monitor 6,320+ cloud services and get alerted the second something breaks.
This incident has been resolved.
We have made some network configuration changes on gpu-0 which have reduced the mlag failover frequency. We will monitor for any regression. gpu-0 is now availabel again in SLurm.
During last week's network issues we discovered gpu-0 had its own set of unrelated problems and hence it has been drained. The node is suffering from frequent mlag failover on its bonded interface. We will be reseating and testing cables early this week.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6320 services available
Integrations with