RAID and stupidity

My RAID5 setup broke again. The controller told me that one of the hard disks had failed. OK, no biggy, that’s what RAID5 is there, all the data are safe. Sure, but Murphy’s Law states that when something can break at the most inconvenient time, it will. So did another PC and the bug is so weird that to this day I still have found no way to fix it (short of a format). Amidst my frustration, I left it alone to go fix my RAID.

Big mistake. Never troubleshoot when frustrated or sleepy or whatever impairs your judgment. The following is given in historical order, but keep in mind that I’m messing with the wrong disk, just so you can laugh at me.

I removed the disk that failed (which now I know was a healthy one). Went back in to check on my data and I found them inaccessible. Befuddled, I tried to see if it was just a glitch so I took the failed disk to my main PC, run a chkdsk and it did not find any errors. Plug it back in the RAID and try a rebuild. Nope… no luck. OK then, maybe if I format it and check it. Back to the main PC for a low level format, which went on very well (yes, I formatted a healthy disk…). Back to the RAID and rebuild, nothing!

All the while, the RAID controller shows me info that should have woken me up, but I was just too involved in my process of reviving the “failed” disk to figure out.

OK, back to the main PC and format it again, make a system volume and test copy files and run chkdsk. All works perfectly! Now I’m really starting to lose it. Back in the RAID and rebuild… nothing (of course…). At this time I give up and task my laptop to temporary storage, thinking that the disks is indeed lost and planning to send it back to WD for a replacement.

Thankfully, I did not send it, but rather, in a moment of clarity, decide to troubleshoot each disks separately. It is during this process that I finally posses enough clarity to call myself a true idiot and do what I should have done.

Lessons to be learned:
* Know the logical number of your controller’s ports, i.e. which port is port0, which is port1 etc.
* Test drives independently and LISTEN to your controller’s console.
* Never troubleshoot when not 100% concentrated
* When one drive fails in a RAID5 array and you cannot access your data, you have removed the WRONG disk. Slap yourself awake and prey you caught it on time.

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.