UPDATE II: I didn’t even have to start combing through the logs: after about 3 minutes of running, the console displayed “BUG: soft lockup – CPU#0 stuck for 10s!” with a CPU register dump of the process “quotacheck.” A quick google shows me that the kernel version I have does indeed have issues with disk quotas. I disabled the disk quotas and rebooted. It’s been running for 20 minutes now without incident so I think it’s fixed.
UPDATE: At the 1 hour and 35 minute mark, the fsck on /var/spool finished. After the mandatory reboot, dustpuppy came up with no further difficulties. I verified that the network interfaces were both up and the bonding pseudo was enabled. I restarted INND (’cause it never starts right the first time after a crash due to it seeing the leftover lockfiles).
Now I’m gonna start combing through the logfiles to see WTF happened to wedge the system that badly.
I keep a spare LCD panel and keyboard attached to dustpuppy (my RHEL server) just in case I need console access. Today, I’m glad I do keep said equipment connected. I awoke to a very wedged server. And I do mean VERY wedged, as in I needed to employ the medium-sized black switch to reboot. (Little white switch = reset button, medium-sized black switch = power button, big black switch = circuit breaker, Big Red Switch = emergency power cutoff. Most systems don’t have Big Red Switches anymore except for large machine rooms. The critical part of the definition of a Big Red Switch is that it will cost big bucks to recover from the results of pushing one – like replacing damaged power supplies when the IBM Mainframes fire bolts into their main transformers during a SCRAM or cleaning up fire suppressant foam.)
Anyway, after hard-shutdown, I rebooted and was told that my Linux MDRAID-5 array was dirty and needed to be rebuilt. Okay, fine. That’s nothing new – no disks failed so it should be a simple resync from parity data. Then the fscks started. All my filesystems are ext3 (journaling) except for /tmp, /var/spool, and /boot. /boot is ext2 because it can be damn tiny that way and it’s normally mounted read-only so it never has to fsck on startup. /tmp isn’t worth the overhead of journaling – it gets nuked on startup anyway. /var/spool was ext3 but I disabled and deleted the journal after some benchmark testing showed a 130% drop in performance with journaling enabled. Being that mail serving is one of the system’s primary roles, I decided that having a slow spool just wouldn’t cut it.
Now one hour into fscking /var/spool, I’m rethinking that decision. The journals were replayed for all the ext3 filesystems with no errors. /dev/VolGroup00/LogVol12 (/export/home) took a little while to replay the journal (~10 minutes) but that’s because it’s huge and does a ton of read/write transactions. The only two filesystems that needed manual fsck were /tmp and /var/spool. /tmp is only 2GB so a fsck took only 20 seconds. /var/spool is a dense filesystem with a hashed B*tree index. It contains around 400,000 files (spools for mail, news, print, samba, nfs, mailman, yum). A good chunk of the news spool (all newsgroups starting with alt.n*, about 6,000 files) had multiply-claimed blocks. Well, no problem, just clone them. 1 hour and 5 minutes later it’s still cloning. See, I found out that for each multiply-claimed block, the cloner needs to make a pass for each inode that claims it. Since all 6k files claim the same inode, there needs to be 36,000,000 passes total, 6,000 per inode. Yeah. Good luck with that. I’d cancel and format if the mail spool wasn’t also on that filesystem. The mail spool has 6 years worth of e-mail transaction data that needs to be kept. It’s all backed-up of course but I’d rather not have to nuke the filesystem and restore from backup just because fsck takes too long.
It looks like today is going to be a 4 cup day. I’ll keep you posted.