Trusted by 1,000+ teams
Stop finding out about outages from your users. Monitor 6,320+ cloud services and get alerted the second something breaks.
All bare metal compute nodes have now had the network config change applied. Packet loss is no longer occurring. In addition we have since identified a workload that was causing slow i/o response from the filesystem. This has been removed whilst we work to improve it.
The network config change has now been applied to compute-2 and -4 successfully. Nix is now running better on these two nodes. We will continue to apply the same change to all the compute nodes as they become available.
We are continuing to monitor for any further issues.
The cluster and login nodes appear to be stable and performant now, although Nix may be slow on some compute nodes (2 and 4). We will be implementing a network config change on each compute node, in a rolling fashion to minimise the impact. This requires draining each node in the Slurm cluster, one at a time.
We are continuing to see slow response issues with Slurm and Nix but it seems to be intermittent. Investigation continues.
We made a network configuration change to a single node last night. The cluster has been stable overnight, with some load on it. We'll continue monitoring today as the load increases. We will make the same change to the other bare metal compute nodes, in a rolling fashion, as they become available.
We have found evidence of network packet loss again and are continuing to investigate
We are currently investigating this issue.
With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.
Start free trialNo credit card required · Cancel anytime · 6320 services available
Integrations with