Outage in Farm HPC cluster

Farm: nas-12-2 down due to multiple disk failures

Resolved Major

March 31, 2025 - Started 11 months ago - Lasted 23 days
Official incident page

Incident Report

nas-12-2 has suffered from multiple disk failures. Admins are investigating the best path forward. The following group directories are currently unavailable: awhitehegrp millermrgrp millsgrp runciegrp weimergrp yujingrp The following home directories are unavailable: aavalos7 awhitehe barao bcbaikie bcweimer berdeja crice crios cschles dglemay djprince dkblaufu drbandoy eabernat ecgranad edkoch emmaluu eoziolor fengq hahudson hemstrow hxhu jagill jajpark jamcgirr jassim jcariute jdowen jenwash jmiller1 jroach jrwashab jxnliu katng23 ljcohen madarm11 mam12n mary363 millermr mlyjones mmosmond motch mtreiber namcnabb nmariano nreid pjseba profeta prvasque psbapat rsbrenna sakre saumyaw scsastry seboles sejoslin smhigdon spatel23 tmbolt vfbetsis vpdunne wolfie12 xmixu yoxue ytakim ywdong

Need to monitor Farm HPC cluster outages?

Monitor all your external dependencies in one place
Get instant alerts when outages are detected
Be the first to know if service is down
Show real-time status on private or public status page
Keep your team informed

Start monitoring for free

Latest Updates ( sorted recent to last )

RESOLVED 10 months ago - at 04/24/2025 12:19AM

nas-12-2 recovery was successful. We were able to scrape enough data from the failing drives that ZFS was able to rebuild onto new drives.

MONITORING 10 months ago - at 04/11/2025 11:10PM

nas-12-2's resilver finished ahead of schedule, so we have re-enabled it for use in Farm. One disk kicked off another resilver, but it is the only disk in that vdev with any issues, so we feel pretty comfortable allowing that to happen in the background.

If you run into issues with nas-12-2, please open a Farm Support Ticket.

IDENTIFIED 10 months ago - at 04/09/2025 05:28PM

Disk replacements were successfully performed yesterday, and data reconstruction onto them is in progress. ZFS is currently estimating it will finish in a little over three days, so the best-case estimate is nas-12-2 will be available for use late Saturday evening. We will provide more updates as the reconstruction progresses.

IDENTIFIED 10 months ago - at 04/07/2025 09:49PM

The ZFS pool scrub (data verification) is in progress. As you can imagine, 409 TB of data takes a while to verify. The current ETA is that it will finish sometime late tonight. This scrub has caused 3 additional hard drives to drop out. The executive decision has been made to replace those drives before allowing users to access the pool. The estimate is an additional 3 days for those drives to have all the data reconstructed onto them, so our best-guess ETA for return-to-service is late this week. We will post additional updates as the disk replacement proceeds.

IDENTIFIED 11 months ago - at 04/05/2025 12:47AM

The ZFS scrub has been started and is being watched carefully.

MONITORING 11 months ago - at 04/04/2025 09:31PM

As tends to happen with failing hard drives, data recovery often goes slower than hoped. Two drives had 100% of the data recovered, and a third had 99.99% recovered. The last drive failed too hard to recover data from, but that is okay, ZFS should be able to reconstruct everything it needs from the first three. A ZFS scrub (data verification) is in progress. When this finishes, likely early next week, we will know for sure the state of all the data on nas-12-2.

IDENTIFIED 11 months ago - at 04/02/2025 08:59PM

Summary: nas-12-2 could be online Friday at the earliest, but more likely early next week.

In consultation with Adam Getchell, the decision has been made to do low-level disk copy from the old, failing drives, to new drives. This will minimize the potential for data loss.

This process is expected to finish Thursday at the earliest. Subsequently, the new disks will be added back to the ZFS pool, and we will trigger a full ZFS data scrub. When that finishes, we will know exactly how much, if any, data loss there is and which files are impacted. That data scrub will take a minimum of 24 hours, so the earliest nas-12-2 could be back in service is late Friday. It is more likely the scrub will run through the weekend, so a more realistic return-to-service is early next week.

IDENTIFIED 11 months ago - at 04/01/2025 01:09AM

nas-12-2 has suffered from multiple disk failures. Admins are investigating the best path forward.

The following group directories are currently unavailable:

awhitehegrp
millermrgrp
millsgrp
runciegrp
weimergrp
yujingrp

The following home directories are unavailable:

aavalos7
awhitehe
barao
bcbaikie
bcweimer
berdeja
crice
crios
cschles
dglemay
djprince
dkblaufu
drbandoy
eabernat
ecgranad
edkoch
emmaluu
eoziolor
fengq
hahudson
hemstrow
hxhu
jagill
jajpark
jamcgirr
jassim
jcariute
jdowen
jenwash
jmiller1
jroach
jrwashab
jxnliu
katng23
ljcohen
madarm11
mam12n
mary363
millermr
mlyjones
mmosmond
motch
mtreiber
namcnabb
nmariano
nreid
pjseba
profeta
prvasque
psbapat
rsbrenna
sakre
saumyaw
scsastry
seboles
sejoslin
smhigdon
spatel23
tmbolt
vfbetsis
vpdunne
wolfie12
xmixu
yoxue
ytakim
ywdong

Latest Farm HPC cluster outages

Slurm not available - 25 days ago

nas-6-1 still not functioning - about 1 month ago

Quobyte Storage is Unavailable - 4 months ago

Farm not currently accepting logins - 5 months ago

Quobyte Unavaiable - 7 months ago

The Status Page Aggregator with Early Outage Detection

With IsDown, you can monitor all your critical services' official status pages from one centralized dashboard and receive instant alerts the moment an outage is detected. Say goodbye to constantly checking multiple sites for updates and stay ahead of outages with IsDown.

Start free trial

No credit card required · Cancel anytime · 5850 services available

Integrations with Slack Microsoft Teams Google Chat Datadog PagerDuty Zapier Discord Webhook