RAID5 MDADM reports inactive after power outage

**agentofcode** · April 4th, 2013

Note to self:
Using hdparm -i /dev/sdb displays HDD information including its serial.

I used this to get the serial of the HDD to match with the physical drive in the case to determine which cable needed to be switched.

**agentofcode** · April 4th, 2013

smartctl -s on -t long /dev/sdb ran on alternate SATA cable and port

smartctl -a /dev/sdb Returns

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Green (Adv. Format)
Device Model: WDC WD20EARS-00MVWB0
Serial Number: WD-WMAZA1970067
LU WWN Device Id: 5 0014ee 057d8ac55
Firmware Version: 51.0AB51
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Apr 4 07:01:09 2013 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 116) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (39300) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 253 166 021 Pre-fail Always - 1050
4 Start_Stop_Count 0x0032 098 098 000 Old_age Always - 2312
5 Reallocated_Sector_Ct 0x0033 193 193 140 Pre-fail Always - 144
7 Seek_Error_Rate 0x002e 200 198 000 Old_age Always - 0
9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 18429
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 165
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 102
193 Load_Cycle_Count 0x0032 189 189 000 Old_age Always - 33509
194 Temperature_Celsius 0x0022 121 108 000 Old_age Always - 29
196 Reallocated_Event_Count 0x0032 195 195 000 Old_age Always - 5
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 27
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 19
199 UDMA_CRC_Error_Count 0x0032 200 137 000 Old_age Always - 26837
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 19

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 40% 18424 2631941936
# 2 Extended offline Completed: read failure 40% 18375 2631941936
# 3 Extended offline Completed: read failure 40% 18349 2631941936
# 4 Extended offline Interrupted (host reset) 90% 18343 -
# 5 Extended offline Interrupted (host reset) 90% 18343 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

**agentofcode** · April 5th, 2013

Correct me if I am wrong but assuming I posted the most current log, it appears that sdb is bad. If this is the same version log, then I am unsure on how to select an alternate "SMART Error Log Version". Also I just discovered the dependability issues with using green drives in a RAID. Not enough preliminary research on my part, rookie mistake. Since discovering this, I have gone ahead and ordered replacement drives to build a dependable RAID5.

Now my goal is to get this broken RAID5 back to functional status so I can back up the data prior to setting up the new RAID5 with more dependable HDDs.

**agentofcode** · April 5th, 2013

Since I determined sdb is bad I decided to try:

mdadm --stop /dev/md0
mdadm --assemble --force /dev/md0 /dev/sd[cd]

to reassemble the RAID with out sdb.

Returns:

mdadm: Cannot assemble mbr metadata on /dev/sdc
mdadm: /dev/sdc has no superblock - assembly aborted

Where do I go from here?

**rubylaser** · April 5th, 2013

You need to assemble with the partitions on those disks like this.

Code:

mdadm --assemble --force /dev/md0 /dev/sd[cd]1

**agentofcode** · April 6th, 2013

Just tried mdadm --assemble --force /dev/md0 /dev/sd[bcd]1

Returns:

mdadm: /dev/md0 has been started with 2 drives (out of 3) and 1 rebuilding.

Why would sdb1 begin working again? Does the output not state it is bad?

**agentofcode** · April 6th, 2013

cat /proc/mdstat

Returns:

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdb1[3] sdd1[2] sdc1[1]
3899417600 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
[=>...................] recovery = 7.6% (149796000/1949708800) finish=258.9min speed=115848K/sec

unused devices: <none>

UPDATE: 4/5/13, 10:51PM est
The assembly came back with sdb1[F]. I understand this indicates that the drive failed. So I stopped and started the RAID again with just sdc1 and sdd1 per recommendation.
RAID is now assembled but still not accessible.

**agentofcode** · April 6th, 2013

When I start up a terminal I notice a message:

Could not chdir to home directory /home/username: No such file or directory

And when I try to access my files via webmin, nothing is in the home directory. Seems as though the RAID5 forgot what directory it was linked up with.

**agentofcode** · April 6th, 2013

fdisk -l Returns

Disk /dev/sda: 60.0 GB, 60021399040 bytes
255 heads, 63 sectors/track, 7297 cylinders, total 117229295 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x0001d60b

Device Boot Start End Blocks Id System
/dev/sda1 * 2048 109893631 54945792 83 Linux
/dev/sda2 109895678 117227519 3665921 5 Extended
/dev/sda5 109895680 117227519 3665920 82 Linux swap / Solaris

Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders, total 3907029168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000edfb2

Device Boot Start End Blocks Id System
/dev/sdc1 2048 3899691007 1949844480 fd Linux raid autodetect
/dev/sdc2 3899693054 3907028991 3667969 5 Extended
/dev/sdc5 3899693056 3907028991 3667968 82 Linux swap / Solaris

Disk /dev/sdd: 2000.4 GB, 2000394706432 bytes
255 heads, 63 sectors/track, 243200 cylinders, total 3907020911 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000740d4

Device Boot Start End Blocks Id System
/dev/sdd1 2048 3899682815 1949840384 fd Linux raid autodetect
/dev/sdd2 3899684862 3907018751 3666945 5 Extended
/dev/sdd5 3899684864 3907018751 3666944 82 Linux swap / Solaris

Disk /dev/md0: 3993.0 GB, 3993003622400 bytes
2 heads, 4 sectors/track, 974854400 cylinders, total 7798835200 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 1048576 bytes
Disk identifier: 0x00000000

Disk /dev/md0 doesn't contain a valid partition table

**agentofcode** · April 6th, 2013

I referred back to the original solved thread I mentioned in my first post and discovered I needed to mount it.

Code:

mount /dev/md0 /home

That did the trick

I should be good until I get my new drives and can back up my data and rebuild a more dependable RAID, THANKS!

P.S. I intend to follow your APC UPS tutorial as well. Hope to prevent this from ever happening again.