LoungeLizard
February 16th, 2009, 05:15 AM
I recently built a new machine with a 500Gb PATA HD primary, IDE DVD R/W on a Mobo having a nVidia SATA RAID controller using AMD64x2 processor and 4Gb RAM. I added 4 320Gb SATA drives and configed in BIOS as RAID5 giving me a little over 900Gb storage space. I installed Intrepid 64-bit partitioning the primary drive to have 5 Gb SWAP and all the rest as root. Then, I had to apt-get install gParted and dm-raid - my install didn't automagically recognize the RAID as a single drive and manually configure these drives - so I was left to my wits and my friend google to figure it out how to do the 'fakeRAID' thaing. After I got dmraid to map the drives and gParted to create one giant partition I called "data" and mounted it in the root tree as /data, I setup SAMBA and created a share for the raid folder. Then, I upgraded everything to the latest and greatest with synaptic and within a total time of about 6 hours, I had my jammin' fileserver with a future database ready to rock. I was happy, all seemed fine and dandy. ...this was yesterday...
Late last night, I started a file transfer job from a winders box that has an old maxtor drive that's down to 19% according to the health monitor (it has developed a number of bad sectors and really needs to be decommisioned, failure is imminent). I got up this morning, the task had crapped out at some point on the winders side complaining of a CRC issue and I started it up again. It went on a few more hours and then the Interpid machine froze and locked up - no keyboard or mouse movement possible, gKrellm monitor frozen. WTF... I waited a short time (~20 minutes or so) and cut power after no change and response to any keyboard attempt.
restored power, and rebooted to find my primary drive missing in the POST. Went into BIOS, looked around, drive wasn't showing up... then exited, and magically the drive appeared at the subsequent POST. This is not a good sign, all the hardware is brand-spanking new. I went to continue reboot, it found GRUB and entered stage 1.5 - but choked and got the dreaded "no init found" and dropped into the BusyBox shell. I reloaded, thinking that my primary drive's UUID got corrupt somehow, intercepted GRUB and forced a plain no UUID boot, and was able to successfully boot. But, I noticed (I removed quiet and splash - so I can see what's happening) that a lot of error messages concerning the RAID drives during the boot process. Going into syslog I found some of this
[ 21.827569] ata1.00: cmd 25/00:a8:5f:13:c0/00:00:00:1c:00:00/e0 tag 0 dma 86016 in
[ 21.827570] res 51/40:00:eb:13:c0/40:00:00:1c:00:00/e0 Emask 0x9 (media error)
[ 21.827640] ata1.00: status: { DRDY ERR }
[ 21.827675] ata1.00: error: { UNC }
[ 24.284803] ata1.00: exception Emask 0x0 Sact 0x0 Serr 0x0 action 0x0
[ 24.284838] ata1.00: BDMA stat 0x65
[ 24.284874] ata1.00: cmd 25/00:a8:5f:13:c0/00:00:00:1c:00:00/e0 tag 0 dma 86016 in
[ 24.284875] res 51/40:00:eb:13:c0/40:00:00:1c:00:00/e0 Emask 0x9 (media error)
[ 24.284945] ata1.00: status: { DRDY ERR }
[ 24.284979] ata1.00: error: { UNC }
[ 24.419032] end-request: I/O error, dev sda, sector 482350059
[ 24.419264] JDB: Failed to read block at offset 32336
[ 24.419299] JDB: I/O error -5 recovering block 32336 in log
[ 24.419335] JDB: Failed to read block at offset 32337
[ 24.419369] JDB: I/O error -5 recovering block 32337 in log
[ 24.419405] JDB: Failed to read block at offset 32338
[ 24.419439] JDB: I/O error -5 recovering block 32338 in log
[ 24.419475] JDB: Failed to read block at offset 32339
[ 24.419509] JDB: I/O error -5 recovering block 32339 in log
[ 62.048847] EXT3-fs: error loading journal
While digging through the logs, I got another freeze of the system. and I did another waiting of 20 minutes or so, periodically checking if it released itself... Rebooted the neaderthal method and got a forced fsck
/dev/sda1 contains a file system with errors, check forced.
/dev/sda1:
Inode 18326805 has illegal block(s).
/dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without –a or –p options)
Fsck died with exit status 4
[fail]
* An automatic file system check (fsck) of the root filesystem failed.
A manual fsck must be performed, then the system reSTARTED.
The fsck should be performed in maintenance mode with the
Root filesystem mounted in read-only mode.
* The root filesystem is currently mounted in read-only mode.
A maintenance shell will now be started.
After performing system maintenance, press CONTROL-D
to terminate the maintenance shell and restart the system.
Give root password for maintenance
(or type Control-D to continue):I went through the complete manual fsck, it found a TON of issues, repaired all of them. Now, we're back...
Got a clean reboot, things appear to work; but there's a problem. Opening the system monitor, I only see a single drive there. Opening up gParted, I can see the nVidia RAID and the huge partition - BUT, now it's showing that about 80% of the space is occupied??? So, all the files I've been porting from the other machine are visible and accesible, but I'm suspecting they aren't getting placed where they need to be and that the RAID array is NOT working correctly. I need some assistance...
Where did I go wrong? Do I need to try and remove the files I've moved from that other drive (gosh, I'm gonna have to go and buy another HD... just so I can get this data off of here)? Do I need to start all over?
da Lizard
Late last night, I started a file transfer job from a winders box that has an old maxtor drive that's down to 19% according to the health monitor (it has developed a number of bad sectors and really needs to be decommisioned, failure is imminent). I got up this morning, the task had crapped out at some point on the winders side complaining of a CRC issue and I started it up again. It went on a few more hours and then the Interpid machine froze and locked up - no keyboard or mouse movement possible, gKrellm monitor frozen. WTF... I waited a short time (~20 minutes or so) and cut power after no change and response to any keyboard attempt.
restored power, and rebooted to find my primary drive missing in the POST. Went into BIOS, looked around, drive wasn't showing up... then exited, and magically the drive appeared at the subsequent POST. This is not a good sign, all the hardware is brand-spanking new. I went to continue reboot, it found GRUB and entered stage 1.5 - but choked and got the dreaded "no init found" and dropped into the BusyBox shell. I reloaded, thinking that my primary drive's UUID got corrupt somehow, intercepted GRUB and forced a plain no UUID boot, and was able to successfully boot. But, I noticed (I removed quiet and splash - so I can see what's happening) that a lot of error messages concerning the RAID drives during the boot process. Going into syslog I found some of this
[ 21.827569] ata1.00: cmd 25/00:a8:5f:13:c0/00:00:00:1c:00:00/e0 tag 0 dma 86016 in
[ 21.827570] res 51/40:00:eb:13:c0/40:00:00:1c:00:00/e0 Emask 0x9 (media error)
[ 21.827640] ata1.00: status: { DRDY ERR }
[ 21.827675] ata1.00: error: { UNC }
[ 24.284803] ata1.00: exception Emask 0x0 Sact 0x0 Serr 0x0 action 0x0
[ 24.284838] ata1.00: BDMA stat 0x65
[ 24.284874] ata1.00: cmd 25/00:a8:5f:13:c0/00:00:00:1c:00:00/e0 tag 0 dma 86016 in
[ 24.284875] res 51/40:00:eb:13:c0/40:00:00:1c:00:00/e0 Emask 0x9 (media error)
[ 24.284945] ata1.00: status: { DRDY ERR }
[ 24.284979] ata1.00: error: { UNC }
[ 24.419032] end-request: I/O error, dev sda, sector 482350059
[ 24.419264] JDB: Failed to read block at offset 32336
[ 24.419299] JDB: I/O error -5 recovering block 32336 in log
[ 24.419335] JDB: Failed to read block at offset 32337
[ 24.419369] JDB: I/O error -5 recovering block 32337 in log
[ 24.419405] JDB: Failed to read block at offset 32338
[ 24.419439] JDB: I/O error -5 recovering block 32338 in log
[ 24.419475] JDB: Failed to read block at offset 32339
[ 24.419509] JDB: I/O error -5 recovering block 32339 in log
[ 62.048847] EXT3-fs: error loading journal
While digging through the logs, I got another freeze of the system. and I did another waiting of 20 minutes or so, periodically checking if it released itself... Rebooted the neaderthal method and got a forced fsck
/dev/sda1 contains a file system with errors, check forced.
/dev/sda1:
Inode 18326805 has illegal block(s).
/dev/sda1: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
(i.e., without –a or –p options)
Fsck died with exit status 4
[fail]
* An automatic file system check (fsck) of the root filesystem failed.
A manual fsck must be performed, then the system reSTARTED.
The fsck should be performed in maintenance mode with the
Root filesystem mounted in read-only mode.
* The root filesystem is currently mounted in read-only mode.
A maintenance shell will now be started.
After performing system maintenance, press CONTROL-D
to terminate the maintenance shell and restart the system.
Give root password for maintenance
(or type Control-D to continue):I went through the complete manual fsck, it found a TON of issues, repaired all of them. Now, we're back...
Got a clean reboot, things appear to work; but there's a problem. Opening the system monitor, I only see a single drive there. Opening up gParted, I can see the nVidia RAID and the huge partition - BUT, now it's showing that about 80% of the space is occupied??? So, all the files I've been porting from the other machine are visible and accesible, but I'm suspecting they aren't getting placed where they need to be and that the RAID array is NOT working correctly. I need some assistance...
Where did I go wrong? Do I need to try and remove the files I've moved from that other drive (gosh, I'm gonna have to go and buy another HD... just so I can get this data off of here)? Do I need to start all over?
da Lizard