Monthly Archives: September 2006

LVM – From Failure to Recovery

Recently, our LVM died, yes, our 2.2TB (Terabytes, that’s right!) LVM died. How did it die you ask? Simple, with a little lack of luck, a little bit of idiocy and a little trial and error. This is the story of how not to fail disks and the recovery that was required.

The Story

So one seemingly happy night last week, my roommate Chris and I were each in our own office, watching television as might be normal. I was also copying the 9 seasons of the x-files I had just obtained over to our file server, named “rhubarb”. Everything is stored on LVM (with 2 underlying raid5 configs, one software – 750G, one hardware – 1.45TB) which resides at /dev/LVM/BigDisk, and we affectionately refer to it as “BigDisk” or “rhubarb”.

During this watching and copying that was going on, out of the blue, the show I was watching from BigDisk stopped showing. Being later in the evening (around 11pm I believe), and because I wanted to watch the rest of the episode of Ed, I tried to restart it. No Luck!

After trying several other methods which did not include my movement from the room I was in, I decided that it’d be okay to run downstairs to have a look. So I threw on some clothes and headed downstairs hoping for some sort of quick fix to get my episode of Ed to restart.

On my way down the stairs, Chris, hollered at me and asked if I’d done anything to rhubarb. I said I hadn’t and then proceeded downstairs. This, my friends, is where the hilarity ensues, and I highly recommend against what I did that night.

The Error

It appeared everything was in working order, but it obviously wasn’t. Hard drives appeared to be spinning but some weren’t. After some digging and realizing we couldn’t access the LVM (formatted in jfs, btw). We decided it was time to do a reboot and see what was going on with the LVM. Normally, I don’t reboot, so a good 20 minutes went by before any sort of thoughts went that way.

The reboot revealed a power supply had failed. At the time, we had 3 different power supplies providing power to all of the different drives in the system. We weren’t sure which ones had failed, so we started eliminating them, one-by-one. Problem was, we didn’t really have any reliable power supplies with which to test, and this was my biggest mistake. Yes, I take full credit for what happened next, it was a very idiotic move, so don’t you do it. Mind you, these are IDE drives all setup on a software raid 5, over 4 disks. We found a couple other power supplies and hooked them up.

Nothing erroneous happened, no problems, the drives were recognized on boot and it seemed okay. So it was up and running, for about 2 minutes. We restored the LVM ran jfs_fsck, and because my roommate is a Windows Admin, and it’s the only way he’d done it before, the machine “needed” to be rebooted.

Nothing came up!!

I decided it would be okay to use an extender and one of the power cables from the power supply sitting right next to rhubarb. But of course, if you haven’t figured it out by now, it was running!!!

I proceeded to try to power some of the drives with this power supply using the molex connector.

SHOCK!!
(blue light actually was visible from the molex connector to the drive)

I am pretty sure it was dead then, but dummy me, decided to continue. Because there was more than one molex connector, I decided to continue connecting the other molex up to another drive.

SHOCK!!
(oh boy, am I stupid, why can’t I quit when I am ahead. It’d only be the morning before I could go buy a new power supply to fix the problem)

After that, nothing would come up, yes rhubarb would boot, but no BigDisk, boo hoo… No Ed!

In the aftermath of this disaster, my roommate was at least consoling. He said he probably would’ve done the same thing, but I still think there was no excuse. Luckily for me, that’s not the end of the story.

read more »