9.10 upgrade says I have failing hard drive

**phillw** · February 28th, 2010

I am 100% with you there.

But, my manufactures diagnostics says it is okay.

Now, once again I am getting errors reported & the Ubuntu 'test your disk' is not working ... (Tried about 10 minutes -- the short / quick one) ... 30 minutes later, it is 'hung' so I have to cancel it.

But, as I'm also running 10.04 alpha, I have plenty of backups -- You can never have too many backups.

Phill.

**Rick Deckard** · March 8th, 2010

I recently upgraded from Intrepid -> Jaunty -> Karmic - as soon as Karmic booted, I got the warning that my drive was bad and that I should replace it.

I have an Hitachi Deskstar 7K160 / HDS721616PLA380 / 160GB drive.

Palimpsest overall assessment is 'DISK HAS MANY BAD SECTORS - back up data and replace disk'.

I wasn't convinced, as my drive has been running fine for years, so I spent several hours investigating this problem.

I've learnt quite a bit about SMART in the last 24 hours, so I'll post what I know here, so people can make an informed decision before replacing drives - I fear that some people have already replaced perfectly healthy drives because of this false error.

The palimpsest (very stupid name) utility reports that my disk drive has 196,619 bad sectors - THIS IS NOT CORRECT, my drive ACTUALLY has *3* reallocated sectors, which is perfectly fine for a modern disk drive.

In my case, SMART attribute 5 (Reallocated Sector Count) has a raw value** of 0x0B0003000000 - Palimpset assumes that this is a single 48-bit integer value and converts it to 196,619 (0x00000003000B - byte-sequence is reversed low-to-high). The format and meaning of the raw value is entirely up to the manufacturers. They can put what they like in here and don't have to release the meaning of the value - some treat it as a 'trade secret'.

So, as far as SMART monitoring is concerned, what is important are the normalised VALUE and THRESHOLD values.

Here are my drive SMART stats:

Code:

> sudo smartctl -A /dev/sda

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   095   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       85
  3 Spin_Up_Time            0x0007   120   100   024    Pre-fail  Always       -       168 (Average 164)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       1768
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       196619
  7 Seek_Error_Rate         0x000b   100   099   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   136   100   020    Pre-fail  Offline      -       31
  9 Power_On_Hours          0x0012   098   098   000    Old_age   Always       -       16416
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1767
192 Power-Off_Retract_Count 0x0032   099   099   000    Old_age   Always       -       1887
193 Load_Cycle_Count        0x0012   099   099   000    Old_age   Always       -       1887
194 Temperature_Celsius     0x0002   166   130   000    Old_age   Always       -       36 (Lifetime Min/Max 13/47)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       3
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       294

For most attributes the VALUE will start at either 200 or 100 when your drive is new - over time, some of these values will decrease towards the THRESHOLD value. If a VALUE reaches or drops below the THRESHOLD value, the attribute is flagged as FAILED and the health status of your drive may change.

Earlier in this thread, someone made the assumption that the threshold value had a direct relation to the attribute count - this isn't necessarily true. The VALUE and THRESHOLD are calculated and updated by the hard-drive's firmware - only the manufacturer knows what these normalised values really mean (they're more likely to be percentage values than actual counts).

The WHEN_FAILED column will show the point in the lifetime of the drive, that the attribute VALUE reached the THRESHOLD - the drive keeps track of how many hours it has been in use (powered-on accumulated hours) and will put the current value in the WHEN_FAILED column.

As you can see from the smartctl output, NONE of my drive's attributes values have reached their threshold and therefore all the WHEN_FAILED values are blank.

I have a healthy drive.

Palimpsest should not be interpreting the raw values of some attributes and then making assumptions about them - it certainly should not be suggesting I change my drive based on a value that has NO AGREED FORMAT. The hard drive firmware is designed to indicate problems through the SMART attributes table - the important indicators are VALUE, THRESHOLD and WHEN_FAILED and that is what I'll be paying careful attention to, from now on.

** (to get a raw value from palimpsest, just hover the mouse pointer over the attribute)

**kramerr** · April 13th, 2010

My 2 cents:

Assuming the "Bad sector" probability subspace doesn't naturally take up near 100% of all failure probability space, the scenario we are in is highly improbable. You'd think one of us would find our drives were failing for a different reason. I'm concluding its a bug.

**Rick Deckard** · April 13th, 2010

Prompted by the last post, I decided to check on my drives again out of curiosity. Bizarrely, palimpsest is now reporting 'SMART unavailable' for my two installed hard-drives...

...oh well, I really wish I hadn't 'upgraded' from Jaunty to Karmic. Roll-on 10.04...

**googeek** · April 14th, 2010

!!!WARNING!!!THIS REALLY MAY NOT BE A BUG. I thought it was for the longest time. I'm a techie when it comes to hardware though, and after testing it on over 5 drives all in the same box, different boxes, different jumpers, configs, you name it.... the report only occured for me only on the drives that eventually went bad. All the drives that it said were failing have, either a week later, or 3 months later, etc. If you have this error, make sure to back up pretty much all the time. One minute your drive will work, the next it wont. period. Until then though, to my knowledge it will work perfectly, which is what makes it seem like a simple bug. I discounted this error over and over and thank goodness I back up fairly religiously or I would have lost more data. So, please, back your important stuff up and stop thinking of this as a bug before you lose some data like I did. If it so happens your hard drive doesn't fail for the next year and a half you have permission to call me a horses ****.

Thread: 9.10 upgrade says I have failing hard drive

Thread Tools

Display

Re: 9.10 upgrade says I have failing hard drive

Re: 9.10 upgrade says I have failing hard drive

Re: 9.10 upgrade says I have failing hard drive

Re: 9.10 upgrade says I have failing hard drive

Re: 9.10 upgrade says I have failing hard drive

Bookmarks

Bookmarks

Posting Permissions