Use Cases
Software Products MSPs Schools Development & Marketing DevOps Agencies Help Desk
 
Internet Status Blog Pricing Log In Try IsDown for free now

Outage in AgResearch eRI

compute-[1-4] nodes affected by a GPFS long waiter

Resolved Minor
January 12, 2026 - Started 3 months ago - Lasted 6 days
Official incident page

Incident Report

Compute-[1-4] are all being affected by a long GPFS waiter on the storage cluster. However Slurm jobs continue to run there so we are attempting to resolve the issue without killing all the jobs. We need to restart GPFS on those nodes, so we are currently draining compute-1 and -4 as a first step. If the situation deteriorates further we may be forced to kill all jobs on those nodes so we can restart GPFS on all four nodes.

Need to monitor AgResearch eRI outages?

  • Monitor all your external dependencies in one place
  • Get instant alerts when outages are detected
  • Show real-time status on private or public status page
  • Keep your team informed
Latest Updates ( sorted recent to last )
RESOLVED 3 months ago - at 01/18/2026 08:05PM

The compute-1 GPFS restart has now been completed and the associated waiter has been cleared. All nodes are now available to Slurm

MONITORING 3 months ago - at 01/14/2026 07:53PM

Compute-4 has now been restarted, and the storage side deadlock has now been cleared. Compute-1 has a different waiter problem so is still draining until we can restart GPFS there. We will continue to manage and communicate that status via this status page. All other compute nodes are now available

IDENTIFIED 3 months ago - at 01/13/2026 01:27AM

The deadlock on compute-3 has now been cleared, the node is available in Slurm

IDENTIFIED 3 months ago - at 01/13/2026 01:18AM

Compute-3 is now stuck in a completing state so we are going to attempt a restart of GPFS there. Any jobs still running there will unfortunately be killed

IDENTIFIED 3 months ago - at 01/12/2026 10:01PM

Compute-[1-4] are all being affected by a long GPFS waiter on the storage cluster. However Slurm jobs continue to run there so we are attempting to resolve the issue without killing all the jobs. We need to restart GPFS on those nodes, so we are currently draining compute-1 and -4 as a first step. If the situation deteriorates further we may be forced to kill all jobs on those nodes so we can restart GPFS on all four nodes.

The Status Page Aggregator with Early Outage Detection

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 6020 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook