[ubuntu] 9.10 upgrade says I have failing hard drive

Re: 9.10 upgrade says I have failing hard drive

I am 100% with you there.

But, my manufactures diagnostics says it is okay.

Now, once again I am getting errors reported & the Ubuntu 'test your disk' is not working ... (Tried about 10 minutes -- the short / quick one) ... 30 minutes later, it is 'hung' so I have to cancel it.

But, as I'm also running 10.04 alpha, I have plenty of backups -- You can never have too many backups.

Phill.

Re: 9.10 upgrade says I have failing hard drive

I recently upgraded from Intrepid -> Jaunty -> Karmic - as soon as Karmic booted, I got the warning that my drive was bad and that I should replace it.

I have an Hitachi Deskstar 7K160 / HDS721616PLA380 / 160GB drive.

Palimpsest overall assessment is 'DISK HAS MANY BAD SECTORS - back up data and replace disk'.

I wasn't convinced, as my drive has been running fine for years, so I spent several hours investigating this problem.

I've learnt quite a bit about SMART in the last 24 hours, so I'll post what I know here, so people can make an informed decision before replacing drives - I fear that some people have already replaced perfectly healthy drives because of this false error.

The palimpsest (very stupid name) utility reports that my disk drive has 196,619 bad sectors - THIS IS NOT CORRECT, my drive ACTUALLY has *3* reallocated sectors, which is perfectly fine for a modern disk drive.

In my case, SMART attribute 5 (Reallocated Sector Count) has a raw value** of 0x0B0003000000 - Palimpset assumes that this is a single 48-bit integer value and converts it to 196,619 (0x00000003000B - byte-sequence is reversed low-to-high). The format and meaning of the raw value is entirely up to the manufacturers. They can put what they like in here and don't have to release the meaning of the value - some treat it as a 'trade secret'.

So, as far as SMART monitoring is concerned, what is important are the normalised VALUE and THRESHOLD values.

Here are my drive SMART stats:

Code:

> sudo smartctl -A /dev/sda ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 095 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 85 3 Spin_Up_Time 0x0007 120 100 024 Pre-fail Always - 168 (Average 164) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 1768 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 196619 7 Seek_Error_Rate 0x000b 100 099 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 136 100 020 Pre-fail Offline - 31 9 Power_On_Hours 0x0012 098 098 000 Old_age Always - 16416 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1767 192 Power-Off_Retract_Count 0x0032 099 099 000 Old_age Always - 1887 193 Load_Cycle_Count 0x0012 099 099 000 Old_age Always - 1887 194 Temperature_Celsius 0x0002 166 130 000 Old_age Always - 36 (Lifetime Min/Max 13/47) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 3 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always - 294

For most attributes the VALUE will start at either 200 or 100 when your drive is new - over time, some of these values will decrease towards the THRESHOLD value. If a VALUE reaches or drops below the THRESHOLD value, the attribute is flagged as FAILED and the health status of your drive may change.

Earlier in this thread, someone made the assumption that the threshold value had a direct relation to the attribute count - this isn't necessarily true. The VALUE and THRESHOLD are calculated and updated by the hard-drive's firmware - only the manufacturer knows what these normalised values really mean (they're more likely to be percentage values than actual counts).

The WHEN_FAILED column will show the point in the lifetime of the drive, that the attribute VALUE reached the THRESHOLD - the drive keeps track of how many hours it has been in use (powered-on accumulated hours) and will put the current value in the WHEN_FAILED column.

As you can see from the smartctl output, NONE of my drive's attributes values have reached their threshold and therefore all the WHEN_FAILED values are blank.

I have a healthy drive.

Palimpsest should not be interpreting the raw values of some attributes and then making assumptions about them - it certainly should not be suggesting I change my drive based on a value that has NO AGREED FORMAT. The hard drive firmware is designed to indicate problems through the SMART attributes table - the important indicators are VALUE, THRESHOLD and WHEN_FAILED and that is what I'll be paying careful attention to, from now on.

** (to get a raw value from palimpsest, just hover the mouse pointer over the attribute)

Re: 9.10 upgrade says I have failing hard drive

My 2 cents:

Assuming the "Bad sector" probability subspace doesn't naturally take up near 100% of all failure probability space, the scenario we are in is highly improbable. You'd think one of us would find our drives were failing for a different reason. I'm concluding its a bug.

Re: 9.10 upgrade says I have failing hard drive

Prompted by the last post, I decided to check on my drives again out of curiosity. Bizarrely, palimpsest is now reporting 'SMART unavailable' for my two installed hard-drives...

...oh well, I really wish I hadn't 'upgraded' from Jaunty to Karmic. Roll-on 10.04...

Re: 9.10 upgrade says I have failing hard drive

!!!WARNING!!!THIS REALLY MAY NOT BE A BUG. I thought it was for the longest time. I'm a techie when it comes to hardware though, and after testing it on over 5 drives all in the same box, different boxes, different jumpers, configs, you name it.... the report only occured for me only on the drives that eventually went bad. All the drives that it said were failing have, either a week later, or 3 months later, etc. If you have this error, make sure to back up pretty much all the time. One minute your drive will work, the next it wont. period. Until then though, to my knowledge it will work perfectly, which is what makes it seem like a simple bug. I discounted this error over and over and thank goodness I back up fairly religiously or I would have lost more data. So, please, back your important stuff up and stop thinking of this as a bug before you lose some data like I did. If it so happens your hard drive doesn't fail for the next year and a half you have permission to call me a horses ****.