Page 2 of 4 FirstFirst 1234 LastLast
Results 11 to 20 of 33

Thread: RAID failure

  1. #11
    Join Date
    Oct 2009
    Beans
    Hidden!
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: RAID failure

    Can you post the output of this again?
    Code:
    sudo blkid -c /dev/null
    I would suggest letting it rebuild, but if the drives keep dropping off, I would look into replacing the drives.
    Come to #ubuntuforums! We have cookies! | Basic Ubuntu Security Guide

    Tomorrow's an illusion and yesterday's a dream, today is a solution...

  2. #12
    Join Date
    Aug 2010
    Beans
    16

    Re: RAID failure

    Code:
    /dev/sda1: UUID="15a6baa5-9692-4f56-8654-e715d4d1d4d4" TYPE="ext4" 
    /dev/sda5: UUID="dea126ee-1593-4fb1-99bf-8a3aad94c1aa" TYPE="swap" 
    /dev/sdb1: UUID="8e5ee34c-1a0f-f976-b8d6-3fee79dcf685" UUID_SUB="dd41d054-dbe4-1263-961d-170291703ade" LABEL="MediaServer:0" TYPE="linux_raid_member" 
    /dev/sdc1: UUID="8e5ee34c-1a0f-f976-b8d6-3fee79dcf685" UUID_SUB="bd31d8ac-3ab9-d88b-3797-e5f7df0bc6a7" LABEL="MediaServer:0" TYPE="linux_raid_member" 
    /dev/sdd1: UUID="8e5ee34c-1a0f-f976-b8d6-3fee79dcf685" UUID_SUB="8eb5ff9c-6edf-5f2c-d2e0-271e3e9d62a5" LABEL="MediaServer:0" TYPE="linux_raid_member" 
    /dev/md0p1: UUID="082e57b7-e658-4922-b7c7-46f15981e9b6" TYPE="ext4"
    I also tried to --re-add sdc1 to the array but got the response "mdadm: --re-add for /dev/sdc to /dev/md0 is not possible"

    Looking at the examines for all the drives, sdc1 thinks it is part of the array, but nothing else does....

    And I really hope I shouldn't have to replace the drives, I only bought them a month ago

  3. #13
    Join Date
    Oct 2009
    Beans
    Hidden!
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: RAID failure

    You might have to zero out the super blocks on sdc1 and add it to the array again, but I think you should wait to hear back from rubylaser before doing anything.
    Come to #ubuntuforums! We have cookies! | Basic Ubuntu Security Guide

    Tomorrow's an illusion and yesterday's a dream, today is a solution...

  4. #14
    Join Date
    Jul 2010
    Location
    Michigan, USA
    Beans
    2,133
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: RAID failure

    You should do as CharlesA suggests. Zero /dev/sdc1's superblock, and then add it to the array.

    Code:
    mdadm --zero-superblock /dev/sdc1
    mdadm --manage /dev/md0 --add /dev/sdc1
    You can view the sync process like this.
    Code:
    watch cat /proc/mdstat
    Also, this is normally an indicator that a disk is failing. What's the SMART data look like on /dev/sdc?
    Code:
    apt-get install smartmontools
    smartctl -a /dev/sdc

  5. #15
    Join Date
    Aug 2010
    Beans
    16

    Re: RAID failure

    Well thankyou both for your help, the raid is back in one piece and is working as expected and has done so for the past few days.

    As requested here is the smartctrl output.

    Code:
    smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.8.0-29-generic] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
    
    === START OF INFORMATION SECTION ===
    Device Model:     TOSHIBA DT01ACA200
    Serial Number:    83CWVNUKS
    LU WWN Device Id: 5 000039 ff3dac1a5
    Firmware Version: MX4OABB0
    User Capacity:    2,000,398,934,016 bytes [2.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   8
    ATA Standard is:  ATA-8-ACS revision 4
    Local Time is:    Tue Dec 31 08:55:18 2013 EST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x84)    Offline data collection activity
                        was suspended by an interrupting command from host.
                        Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)    The previous self-test routine completed
                        without error or no self-test has ever 
                        been run.
    Total time to complete Offline 
    data collection:         (15396) seconds.
    Offline data collection
    capabilities:              (0x5b) SMART execute Offline immediate.
                        Auto Offline data collection on/off support.
                        Suspend Offline collection upon new
                        command.
                        Offline surface scan supported.
                        Self-test supported.
                        No Conveyance Self-test supported.
                        Selective Self-test supported.
    SMART capabilities:            (0x0003)    Saves SMART data before entering
                        power-saving mode.
                        Supports SMART auto save timer.
    Error logging capability:        (0x01)    Error logging supported.
                        General Purpose Logging supported.
    Short self-test routine 
    recommended polling time:      (   1) minutes.
    Extended self-test routine
    recommended polling time:      ( 255) minutes.
    SCT capabilities:            (0x003d)    SCT Status supported.
                        SCT Error Recovery Control supported.
                        SCT Feature Control supported.
                        SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0005   140   140   054    Pre-fail  Offline      -       68
      3 Spin_Up_Time            0x0007   131   131   024    Pre-fail  Always       -       292 (Average 286)
      4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       32
      5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   124   124   020    Pre-fail  Offline      -       33
      9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       102
     10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       32
    193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       32
    194 Temperature_Celsius     0x0002   250   250   000    Old_age   Always       -       24 (Min/Max 19/55)
    196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       5
    
    SMART Error Log Version: 1
    ATA Error Count: 5
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.
    
    Error 5 occurred at disk power-on lifetime: 97 hours (4 days + 1 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 42 ee b0 2b 0b  Error: ICRC, ABRT at LBA = 0x0b2bb0ee = 187412718
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 00 30 b0 2b 40 00      19:23:24.163  WRITE FPDMA QUEUED
      61 00 00 30 ac 2b 40 00      19:23:24.158  WRITE FPDMA QUEUED
      61 00 00 30 a8 2b 40 00      19:23:24.150  WRITE FPDMA QUEUED
      61 00 00 30 a4 2b 40 00      19:23:24.146  WRITE FPDMA QUEUED
      61 00 00 30 a0 2b 40 00      19:23:24.138  WRITE FPDMA QUEUED
    
    Error 4 occurred at disk power-on lifetime: 95 hours (3 days + 23 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 c2 a6 9c dc 01  Error: ICRC, ABRT at LBA = 0x01dc9ca6 = 31235238
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 00 68 9b dc 40 00      17:49:36.997  WRITE FPDMA QUEUED
      61 00 00 68 97 dc 40 00      17:49:36.990  WRITE FPDMA QUEUED
      61 00 00 68 93 dc 40 00      17:49:36.985  WRITE FPDMA QUEUED
      61 00 00 68 8f dc 40 00      17:49:36.978  WRITE FPDMA QUEUED
      61 00 00 68 8b dc 40 00      17:49:36.973  WRITE FPDMA QUEUED
    
    Error 3 occurred at disk power-on lifetime: 95 hours (3 days + 23 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 97 b9 56 6f 0a  Error: ICRC, ABRT at LBA = 0x0a6f56b9 = 175068857
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 97 b9 56 6f 4a ff      17:37:27.445  WRITE FPDMA QUEUED
      61 00 00 50 53 6f 40 00      17:37:27.434  WRITE FPDMA QUEUED
      61 00 00 50 4f 6f 40 00      17:37:27.426  WRITE FPDMA QUEUED
      61 00 00 50 4b 6f 40 00      17:37:27.422  WRITE FPDMA QUEUED
      61 00 00 50 47 6f 40 00      17:37:27.414  WRITE FPDMA QUEUED
    
    Error 2 occurred at disk power-on lifetime: 95 hours (3 days + 23 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 01 1f aa 39 08  Error: ICRC, ABRT at LBA = 0x0839aa1f = 137996831
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 00 20 a7 39 40 00      17:07:39.791  WRITE FPDMA QUEUED
      61 00 00 20 a3 39 40 00      17:07:39.787  WRITE FPDMA QUEUED
      61 00 00 20 9f 39 40 00      17:07:39.779  WRITE FPDMA QUEUED
      61 00 00 20 9b 39 40 00      17:07:39.775  WRITE FPDMA QUEUED
      61 00 00 20 97 39 40 00      17:07:39.768  WRITE FPDMA QUEUED
    
    Error 1 occurred at disk power-on lifetime: 93 hours (3 days + 21 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 f1 57 12 7d 05  Error: ICRC, ABRT at LBA = 0x057d1257 = 92082775
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 00 48 0f 7d 40 00      14:52:21.139  WRITE FPDMA QUEUED
      61 00 00 48 0b 7d 40 00      14:52:21.131  WRITE FPDMA QUEUED
      61 00 00 48 07 7d 40 00      14:52:21.127  WRITE FPDMA QUEUED
      61 00 00 48 03 7d 40 00      14:52:21.119  WRITE FPDMA QUEUED
      61 00 00 48 ff 7c 40 00      14:52:21.115  WRITE FPDMA QUEUED
    
    SMART Self-test log structure revision number 1
    No self-tests have been logged.  [To run self-tests, use: smartctl -t]
    
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    I'll wait for your advice over this drive, but for arguements sake here are the other two drives in the array as well

    SDB
    Code:
    smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.8.0-29-generic] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
    
    === START OF INFORMATION SECTION ===
    Device Model:     TOSHIBA DT01ACA200
    Serial Number:    83CWVNPKS
    LU WWN Device Id: 5 000039 ff3dac1a1
    Firmware Version: MX4OABB0
    User Capacity:    2,000,398,934,016 bytes [2.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   8
    ATA Standard is:  ATA-8-ACS revision 4
    Local Time is:    Tue Dec 31 08:58:20 2013 EST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x84)    Offline data collection activity
                        was suspended by an interrupting command from host.
                        Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)    The previous self-test routine completed
                        without error or no self-test has ever 
                        been run.
    Total time to complete Offline 
    data collection:         (15013) seconds.
    Offline data collection
    capabilities:              (0x5b) SMART execute Offline immediate.
                        Auto Offline data collection on/off support.
                        Suspend Offline collection upon new
                        command.
                        Offline surface scan supported.
                        Self-test supported.
                        No Conveyance Self-test supported.
                        Selective Self-test supported.
    SMART capabilities:            (0x0003)    Saves SMART data before entering
                        power-saving mode.
                        Supports SMART auto save timer.
    Error logging capability:        (0x01)    Error logging supported.
                        General Purpose Logging supported.
    Short self-test routine 
    recommended polling time:      (   1) minutes.
    Extended self-test routine
    recommended polling time:      ( 251) minutes.
    SCT capabilities:            (0x003d)    SCT Status supported.
                        SCT Error Recovery Control supported.
                        SCT Feature Control supported.
                        SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0005   140   140   054    Pre-fail  Offline      -       68
      3 Spin_Up_Time            0x0007   132   132   024    Pre-fail  Always       -       286 (Average 286)
      4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       32
      5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   124   124   020    Pre-fail  Offline      -       33
      9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       102
     10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       32
    193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       32
    194 Temperature_Celsius     0x0002   230   230   000    Old_age   Always       -       26 (Min/Max 19/58)
    196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    No self-tests have been logged.  [To run self-tests, use: smartctl -t]
    
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    SDD
    Code:
    smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.8.0-29-generic] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
    
    === START OF INFORMATION SECTION ===
    Device Model:     TOSHIBA DT01ACA200
    Serial Number:    83CWXZEKS
    LU WWN Device Id: 5 000039 ff3daca51
    Firmware Version: MX4OABB0
    User Capacity:    2,000,398,934,016 bytes [2.00 TB]
    Sector Sizes:     512 bytes logical, 4096 bytes physical
    Device is:        Not in smartctl database [for details use: -P showall]
    ATA Version is:   8
    ATA Standard is:  ATA-8-ACS revision 4
    Local Time is:    Tue Dec 31 08:59:33 2013 EST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    General SMART Values:
    Offline data collection status:  (0x84)    Offline data collection activity
                        was suspended by an interrupting command from host.
                        Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)    The previous self-test routine completed
                        without error or no self-test has ever 
                        been run.
    Total time to complete Offline 
    data collection:         (14726) seconds.
    Offline data collection
    capabilities:              (0x5b) SMART execute Offline immediate.
                        Auto Offline data collection on/off support.
                        Suspend Offline collection upon new
                        command.
                        Offline surface scan supported.
                        Self-test supported.
                        No Conveyance Self-test supported.
                        Selective Self-test supported.
    SMART capabilities:            (0x0003)    Saves SMART data before entering
                        power-saving mode.
                        Supports SMART auto save timer.
    Error logging capability:        (0x01)    Error logging supported.
                        General Purpose Logging supported.
    Short self-test routine 
    recommended polling time:      (   1) minutes.
    Extended self-test routine
    recommended polling time:      ( 246) minutes.
    SCT capabilities:            (0x003d)    SCT Status supported.
                        SCT Error Recovery Control supported.
                        SCT Feature Control supported.
                        SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0005   140   140   054    Pre-fail  Offline      -       68
      3 Spin_Up_Time            0x0007   134   134   024    Pre-fail  Always       -       286 (Average 278)
      4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       32
      5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   124   124   020    Pre-fail  Offline      -       33
      9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       102
     10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       32
    193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       32
    194 Temperature_Celsius     0x0002   230   230   000    Old_age   Always       -       26 (Min/Max 19/48)
    196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
    
    SMART Error Log Version: 1
    No Errors Logged
    
    SMART Self-test log structure revision number 1
    No self-tests have been logged.  [To run self-tests, use: smartctl -t]
    
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    I havn't yet tried to do a large data copy to the array, as that was what
    I was doing when the array failed. If you can't shed any light on it, I'll just go ahead and continue the copy.

    Again, thankyou so much for your help

  6. #16
    Join Date
    Jul 2010
    Location
    Michigan, USA
    Beans
    2,133
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: RAID failure

    Gad you got it working. The one thing that pops out is your UDMA_CRC_Error_Count on the first disk. That should be zero and usually points to a bad SATA cable or SATA head on the motherboard. I would swap the cable for that disk. Other than that your disks look okay, but it doesn't look like they have ever had a short or long test run on them. I would look into setting up SMART to monitor your disks in the future, and setup email so that mdadm can email you in the future. Those two tutorials are how I do it.

  7. #17
    Join Date
    Oct 2009
    Beans
    Hidden!
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: RAID failure

    Quote Originally Posted by rubylaser View Post
    Gad you got it working. The one thing that pops out is your UDMA_CRC_Error_Count on the first disk. That should be zero and usually points to a bad SATA cable or SATA head on the motherboard. I would swap the cable for that disk. Other than that your disks look okay, but it doesn't look like they have ever had a short or long test run on them. I would look into setting up SMART to monitor your disks in the future, and setup email so that mdadm can email you in the future. Those two tutorials are how I do it.
    Just giving a +1 to those two tutorials. I had a nasty time trying to get smartd to detect the drives in my MegaRAID setup, but that was due to the version of smartmontools in the repos, so I just run everything manually via cron.
    Come to #ubuntuforums! We have cookies! | Basic Ubuntu Security Guide

    Tomorrow's an illusion and yesterday's a dream, today is a solution...

  8. #18
    Join Date
    Aug 2010
    Beans
    16

    Re: RAID failure

    So it turns out that everything is not solved.

    Leave the server running for a while, no issues. Decide that eventually it is stable enough to try adding more files, so select everything that didn't get over to the array the first time and select copy.

    To start with all good, everything copies fine, go away to make dinner and suddenly i have an email from my server telling me its fallen in a great huge heap again.

    Followed the same steps, resynced the array and it was all good, files still intact and so on.

    Is there a bug with the raid array software I don't know about to do with large files, or is there something I am doing wrong.

    I'm about to try copy some stuff over again, but in smaller chuncks and I'm going to copy it from the local hardrive to the array rather than the USB harddrive to the array to see if that makes a difference, but if anyone can think of anything else to try, I'm all ears.

  9. #19
    Join Date
    Oct 2009
    Beans
    Hidden!
    Distro
    Ubuntu 16.04 Xenial Xerus

    Re: RAID failure

    Unsure. Are you copying files across the network or locally?
    Come to #ubuntuforums! We have cookies! | Basic Ubuntu Security Guide

    Tomorrow's an illusion and yesterday's a dream, today is a solution...

  10. #20
    Join Date
    Aug 2010
    Beans
    16

    Re: RAID failure

    Quote Originally Posted by CharlesA View Post
    Unsure. Are you copying files across the network or locally?
    All the copying is done locally.

    To add to the confusion I've got another report.
    Copied about 60gb of files in 3 groups of about 20gb last night, all good. Used the server for several hours last night without issue.
    Used it this morning to watch the media on my tablet via plex, no issues.
    Download finished this morning as I was walking out of the door, and my tablet email alert went off with an email from the server indicating a fail event and then almost straight away a second email for a fail spare event (Note I do not actually have a spare).

    I'm decidedly confused. My understanding is that ubuntu software raid should be more reliable than this.

Page 2 of 4 FirstFirst 1234 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •