Page 3 of 4 FirstFirst 1234 LastLast
Results 21 to 30 of 35

Thread: Help with interpreting SMART data

  1. #21
    Join Date
    Jul 2010
    Location
    Michigan, USA
    Beans
    2,132
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: Help with interpreting SMART data

    Yeah for roughly $120, you are not going to be able to get a great performing hardware RAID card, unless you are buying used There is no cache on your card, so a BBU won't really help because there is no cached writes to protect.

  2. #22
    Join Date
    Oct 2009
    Beans
    Hidden!
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Help with interpreting SMART data

    Quote Originally Posted by rubylaser View Post
    Yeah for roughly $120, you are not going to be able to get a great performing hardware RAID card, unless you are buying used There is no cache on your card, so a BBU won't really help because there is no cached writes to protect.
    Funny, the manual mentioned stuff about recommending a BBU, but I guess that is just to cover themselves in case someone loses data after a power failure.

    Thanks for the info. It's surprising how much you learn over the years.
    Come to #ubuntuforums! We have cookies! | Basic Ubuntu Security Guide

    Tomorrow's an illusion and yesterday's a dream, today is a solution...

  3. #23
    Join Date
    Oct 2009
    Beans
    Hidden!
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Help with interpreting SMART data

    Bumping this cuz I just had my first encounter with 9 (!) URE's around 12 days after I did an extended offline test of that drive. I was doing a consistency check on the newly built RAID6 array when I got back-to-back notifications about this. I guess it does pay to get a good card if you are going to run hardware RAID.

    The only indication that the drive was having problem were a few bad sectors from a year or more ago. No raw read errors or anything. I have a feeling that might have been what caused the rebuild a couple weeks back, but since I have no notification of what actually caused the rebuild other then the "data is not consistent" error, I guess I'll never know.

    Here's the current SMART data - don't mind the 45c temp reading as the server is sitting in my room and it's hot as hell in here.

    Code:
    smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-20-pve] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
    
    /dev/sda [megaraid_disk_16] [SAT]: Device open changed type from 'megaraid' to 'sat'
    === START OF INFORMATION SECTION ===
    Model Family:     Hitachi Deskstar 7K2000
    Device Model:     Hitachi HDS722020ALA330
    Serial Number:    JK1120YAG26EUP
    LU WWN Device Id: 5 000cca 221c100f0
    Firmware Version: JKAOA20N
    User Capacity:    2,000,398,934,016 bytes [2.00 TB]
    Sector Size:      512 bytes logical/physical
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   8
    ATA Standard is:  ATA-8-ACS revision 4
    Local Time is:    Sun Jul  7 22:13:54 2013 PDT
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled
    
    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    Warning: This result is based on an Attribute check.
    
    General SMART Values:
    Offline data collection status:  (0x84)	Offline data collection activity
    					was suspended by an interrupting command from host.
    					Auto Offline Data Collection: Enabled.
    Self-test execution status:      (   0)	The previous self-test routine completed
    					without error or no self-test has ever 
    					been run.
    Total time to complete Offline 
    data collection: 		(22477) seconds.
    Offline data collection
    capabilities: 			 (0x5b) SMART execute Offline immediate.
    					Auto Offline data collection on/off support.
    					Suspend Offline collection upon new
    					command.
    					Offline surface scan supported.
    					Self-test supported.
    					No Conveyance Self-test supported.
    					Selective Self-test supported.
    SMART capabilities:            (0x0003)	Saves SMART data before entering
    					power-saving mode.
    					Supports SMART auto save timer.
    Error logging capability:        (0x01)	Error logging supported.
    					General Purpose Logging supported.
    Short self-test routine 
    recommended polling time: 	 (   1) minutes.
    Extended self-test routine
    recommended polling time: 	 ( 255) minutes.
    SCT capabilities: 	       (0x003d)	SCT Status supported.
    					SCT Error Recovery Control supported.
    					SCT Feature Control supported.
    					SCT Data Table supported.
    
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
      2 Throughput_Performance  0x0005   133   133   054    Pre-fail  Offline      -       101
      3 Spin_Up_Time            0x0007   127   127   024    Pre-fail  Always       -       621 (Average 513)
      4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       195
      5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       3
      7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
      8 Seek_Time_Performance   0x0005   112   112   020    Pre-fail  Offline      -       39
      9 Power_On_Hours          0x0012   096   096   000    Old_age   Always       -       32840
     10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       195
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1013
    193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       1013
    194 Temperature_Celsius     0x0002   125   125   000    Old_age   Always       -       48 (Min/Max 22/54)
    196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       4
    197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
    198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0
    
    SMART Error Log Version: 1
    ATA Error Count: 9 (device log contains only the most recent five errors)
    	CR = Command Register [HEX]
    	FR = Features Register [HEX]
    	SC = Sector Count Register [HEX]
    	SN = Sector Number Register [HEX]
    	CL = Cylinder Low Register [HEX]
    	CH = Cylinder High Register [HEX]
    	DH = Device/Head Register [HEX]
    	DC = Device Command Register [HEX]
    	ER = Error register [HEX]
    	ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.
    
    Error 9 occurred at disk power-on lifetime: 32839 hours (1368 days + 7 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 02 1d 2c ee 00  Error: UNC at LBA = 0x00ee2c1d = 15608861
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 02 00 1d 2c ee 40 00      06:16:35.124  READ FPDMA QUEUED
      61 00 80 00 a6 b0 40 00      06:16:32.935  WRITE FPDMA QUEUED
      61 00 78 00 a4 b0 40 00      06:16:32.935  WRITE FPDMA QUEUED
      61 00 70 00 a2 b0 40 00      06:16:32.934  WRITE FPDMA QUEUED
      61 00 68 00 a0 b0 40 00      06:16:32.934  WRITE FPDMA QUEUED
    
    Error 8 occurred at disk power-on lifetime: 32839 hours (1368 days + 7 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 e1 1f 2c ee 00  Error: WP at LBA = 0x00ee2c1f = 15608863
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 78 00 84 b0 40 00      06:16:12.165  WRITE FPDMA QUEUED
      61 00 f8 00 82 b0 40 00      06:16:12.158  WRITE FPDMA QUEUED
      61 00 70 00 80 b0 40 00      06:16:12.151  WRITE FPDMA QUEUED
      61 00 68 00 7e b0 40 00      06:16:12.144  WRITE FPDMA QUEUED
      61 00 60 00 7c b0 40 00      06:16:12.133  WRITE FPDMA QUEUED
    
    Error 7 occurred at disk power-on lifetime: 32839 hours (1368 days + 7 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 e4 1c 2c ee 00  Error: WP at LBA = 0x00ee2c1c = 15608860
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      61 00 a0 00 04 b0 40 00      06:15:52.662  WRITE FPDMA QUEUED
      61 00 98 00 02 b0 40 00      06:15:52.648  WRITE FPDMA QUEUED
      61 00 90 00 00 b0 40 00      06:15:52.633  WRITE FPDMA QUEUED
      61 00 88 00 fe af 40 00      06:15:52.612  WRITE FPDMA QUEUED
      61 00 78 00 fc af 40 00      06:15:52.598  WRITE FPDMA QUEUED
    
    Error 6 occurred at disk power-on lifetime: 32839 hours (1368 days + 7 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 e5 1b 2c ee 00  Error: UNC at LBA = 0x00ee2c1b = 15608859
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 e6 00 1a 2c ee 40 00      06:15:32.604  READ FPDMA QUEUED
      60 d0 00 00 44 d7 40 00      06:15:32.486  READ FPDMA QUEUED
      2f 00 01 10 00 00 00 00      06:15:32.097  READ LOG EXT
      60 d0 00 00 44 d7 40 00      06:15:27.100  READ FPDMA QUEUED
      60 e7 28 19 2c ee 40 00      06:15:15.002  READ FPDMA QUEUED
    
    Error 5 occurred at disk power-on lifetime: 32839 hours (1368 days + 7 hours)
      When the command that caused the error occurred, the device was active or idle.
    
      After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      40 51 e7 19 2c ee 00  Error: UNC at LBA = 0x00ee2c19 = 15608857
    
      Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      60 d0 00 00 44 d7 40 00      06:15:27.100  READ FPDMA QUEUED
      60 e7 28 19 2c ee 40 00      06:15:15.002  READ FPDMA QUEUED
      61 78 20 08 5a 95 40 00      06:15:15.001  WRITE FPDMA QUEUED
      61 08 18 00 1a a8 40 00      06:15:15.001  WRITE FPDMA QUEUED
      61 08 10 00 5a 9d 40 00      06:15:15.001  WRITE FPDMA QUEUED
    
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
    # 1  Extended offline    Completed without error       00%     32535         -
    # 2  Short offline       Completed without error       00%     32459         -
    
    SMART Selective self-test log data structure revision number 1
     SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
        1        0        0  Not_testing
        2        0        0  Not_testing
        3        0        0  Not_testing
        4        0        0  Not_testing
        5        0        0  Not_testing
    Selective self-test flags (0x0):
      After scanning selected spans, do NOT read-scan remainder of disk.
    If Selective self-test is pending on power-up, resume after 0 minute delay.
    EDIT:
    Quote Originally Posted by tgalati4 View Post
    As far as your original problem, I would suspect a sagging power supply. When you add another disk to a working array and the array starts to show issues, it could simply be too much current draw as all the drives try to spin up at once. That would account for raw read errors (the disks haven't spun up to speed completely) but otherwise OK read and write performance. Put a meter on your PSU and watch for voltage drops on the 12VDC rail during bootup. A 10% drop (10.8VDC) could be an issue, although I think server specifications are no more than 5% drop (11.4VDC).
    I just tested the 12V rail with a DMM, it was reading 12.25V - 12.27V from the time I hit the power button to the time the machine was fully booted. 10% of 12V would be 1.2V, and 5% would be 0.6V, so it looks to be within 2.5% tolerance.

    All drives are set to spin up at the same time, so I guess that means power is good.
    Last edited by CharlesA; July 8th, 2013 at 09:17 AM.
    Come to #ubuntuforums! We have cookies! | Basic Ubuntu Security Guide

    Tomorrow's an illusion and yesterday's a dream, today is a solution...

  4. #24
    Join Date
    Jul 2010
    Location
    Michigan, USA
    Beans
    2,132
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: Help with interpreting SMART data

    Charles, are you seeing this behavior on any of your other disks attached to the same SFF-8087 cable? I ask because these WRITE FPDMA QUEUED errors are typically caused by either a SATA cable/connection or a power issue (loose cable, backplane issue, bad splitter, etc). You have a lot of new pieces in your server: new RAID card, new cables, and hard drive dock, so I wouldn't immediately say this disk is bad. Also, I would try to find a cooler place to run that server, 45c is really hot for a modern hard drive.

  5. #25
    Join Date
    Oct 2009
    Beans
    Hidden!
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Help with interpreting SMART data

    All the other disks are fine. I'll be moving it back to the other room where it's a bit cooler.

    The funny thing is the drive running on the internal drive bays is running at around 7C cooler than the ones in the dock. I'm guessing that is due to there being a 120mm fan on the front panel vs an 80mm fan on the drive cage.

    EDIT: I wonder if I should replace the 80mm fan with something with more CFM, but I dunno.
    Last edited by CharlesA; July 8th, 2013 at 06:29 PM.
    Come to #ubuntuforums! We have cookies! | Basic Ubuntu Security Guide

    Tomorrow's an illusion and yesterday's a dream, today is a solution...

  6. #26
    Join Date
    Jul 2010
    Location
    Michigan, USA
    Beans
    2,132
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: Help with interpreting SMART data

    Quote Originally Posted by CharlesA View Post
    All the other disks are fine. I'll be moving it back to the other room where it's a bit cooler.

    The funny thing is the drive running on the internal drive bays is running at around 7C cooler than the ones in the dock. I'm guessing that is due to there being a 120mm fan on the front panel vs an 80mm fan on the drive cage.

    EDIT: I wonder if I should replace the 80mm fan with something with more CFM, but I dunno.
    That's good to hear. If all other disks are not showing any errors, that does point at this disk. Also, I don't have that version of the Icy Dock enclosure (mine is older), but mine has kept my drives within the same temperature range as those inside the case.

    Here is one from from inside the Icy Dock enclosure.
    Code:
    root@fileserver:~# smartctl -a /dev/sdb | grep Temp
    194 Temperature_Celsius     0x0022   025   040   000    Old_age   Always       -       25 (0 16 0 0)
    and one inside the case with a 120mm in front of it.
    Code:
    root@fileserver:~# smartctl -a /dev/sdf | grep Temp
    194 Temperature_Celsius     0x0002   230   230   000    Old_age   Always       -       26 (Min/Max 16/39)

  7. #27
    Join Date
    Oct 2009
    Beans
    Hidden!
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Help with interpreting SMART data

    Nice. What's the ambient temperature?

    The ambient temperature is current around 84F = 28C. It should keep things around that temperature.

    I just ordered 6 of these drives. The intention being to run 5 of them in RAID6 with 1 as a hot spare.

    I opened the case up and it looks like a SATA power cable might have been touching the fan... which would explain the heat buildup.
    Come to #ubuntuforums! We have cookies! | Basic Ubuntu Security Guide

    Tomorrow's an illusion and yesterday's a dream, today is a solution...

  8. #28
    Join Date
    Jul 2010
    Location
    Michigan, USA
    Beans
    2,132
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: Help with interpreting SMART data

    Quote Originally Posted by CharlesA View Post
    Nice. What's the ambient temperature?

    The ambient temperature is current around 84F = 28C. It should keep things around that temperature.

    I just ordered 6 of these drives. The intention being to run 5 of them in RAID6 with 1 as a hot spare.

    I opened the case up and it looks like a SATA power cable might have been touching the fan... which would explain the heat buildup.
    The ambient temperature in my basement is 70F or ~21.1C, so mine are warmer than ambient by 4-5C. I'm all for data security, but with only (6) spindles, I would feel fine running without a hot spare (unless space isn't an issue at all). That should be a nice, large, and safe storage volume. Now, I just need to have my own Congressional approval (wife) to be able to green light buying 6 disks at once

  9. #29
    Join Date
    Oct 2009
    Beans
    Hidden!
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Help with interpreting SMART data

    Quote Originally Posted by rubylaser View Post
    The ambient temperature in my basement is 70F or ~21.1C, so mine are warmer than ambient by 4-5C. I'm all for data security, but with only (6) spindles, I would feel fine running without a hot spare (unless space isn't an issue at all). That should be a nice, large, and safe storage volume. Now, I just need to have my own Congressional approval (wife) to be able to green light buying 6 disks at once
    My credit card weeps.

    Thanks for the info about the hot spare. My logic on having one was to give the array the ability to rebuild should a drive fail and run off the spare drive while I RMA the bad one.

    I dunno if that logic is actually sound or not, but those were my thought processes.

    I just redid all the cable management and routed stuff away from fans, but now one drive doesn't want to come up (it's drive #1), which didn't show any errors at all.

    Guess I get to troubleshoot the drives and the new drive cage... *sigh*
    Come to #ubuntuforums! We have cookies! | Basic Ubuntu Security Guide

    Tomorrow's an illusion and yesterday's a dream, today is a solution...

  10. #30
    Join Date
    Jul 2010
    Location
    Michigan, USA
    Beans
    2,132
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: Help with interpreting SMART data

    Take a photo of the case and setup. I'd love to see how it all turned out Sorry to hear about the drive not coming up, but I'm sure you'll figure it out. And, 1 hot spare is a great idea from a data security standpoint + you can always still add (2) more drives down the line if you need extra space.

Page 3 of 4 FirstFirst 1234 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •