One of the 1.5TB WD15EARS (Western Digital Green) drives on the Synology DS410 running firmware DSM 4.2-3211 failed earlier this week. The RAID5 volume recovery process ended up being a little tricky thanks to ext3 journal corruption on the actual filesystem “/volume1” block device /dev/md2.
The standard steps for replacing the failed drive is simple:
- Turn off the beeping sound
- Removed failed drive
- Replace failed drive with same or greater storage capacity
- Run “extended” SMART tests on the replacement drive. Takes about a day to complete the test followed by another day for RAID sync
- Repair the volume
After replacing the failed drive and repairing the RAID volume, “/volume1” ended up with ext3 journal inconsistencies. An “e2fsck” had to be manually performed to check the filesystem for inconsistencies and recover the journal from the command line.
- Shutdown down all running processes
- Unmount /volume1 (on /dev/md2)
- Unmount /opt (also on /dev/md2). There is a running sshd which prevented me from unmounting /opt. The fix was to run an alternate sshd from /usr/sbin/ on an alternate port and kill the original sshd running from /opt/
- Run e2fsck -v -y -f /dev/md2. Took a full day to fsck the 4TB volume
The volume certainly looks healthy after the fsck and no more beeping sounds. More importantly, my data appears to be intact.
Learnings from the disk failure:
- Ensure lsof is installed on the NAS. It can greatly speed up identifying processes which are holding on to a volume from being unmounted cleanly. Package installation wont be possible once /opt goes read-only thanks to journal corruption
- Force a full fsck regularly to ensure file systems are health
- Rather than have one large 4TB RAID5 volume spanning 4 disks, have 2 smaller volumes to speed up fsck times and spread the risk.
- Have a 3TB external drive to backup critical data before attempting repait
Update (20130927): A 2nd WD hard disk failed earlier this week causing another round of journal corruption and re-syncing. At this point, it made more sense to perform a full backup and restore over a fresh install of DSM with SHR and an extended SMART test. I am replacing the failed 1.5TB drives with 3TB WD Caviar Greens and SHR can make use of the additional space automatically.