Since more than a year ago I'm occasionally plagued by a hanging *buntu. Everytime this has happened the problem could be related to "FPDMA QUEUED" errors in the kernel log. Like in this bug report https://bugs.launchpad.net/ubuntu/+s...ux/+bug/550559 As this problem is really really annoying (makes your computer unreliable) I want to take a better look and determine if its the drive or just a linux kernel thing.

I have run multiple SMART tests the past year and everytime the disk passed the test. The resent extended (more than 2 hours) test does present read errors but no pending sectors. Also the test could be completed. Furthermore I booted the Ultimate Boot CD and the Western Digital diagnostics tool found no errors on a full scan.

Another weird thing is Windows in this regard. I have had Vista running on this machine for 2 years without any apparent HDD errors (at least no 'show stoppers' for the OS). Presently I have a Windows 7 / Kubuntu dual boot and again never a sign of hard drive trouble in W7.

This is (hard) driving me insane, can anyone shed a light on this situation, what should I do?

Below I have provided some relevant lines from /var/log/kern.log and the results from a SMART test performed today.

This is one of the caught situations where my *buntu install hangs:
Code:
Sep 30 17:30:33 mycomputer kernel: [31175.824062] ata1.00: exception Emask 0x0 SAct 0x7 SErr 0x0 action 0x6 frozen
Sep 30 17:30:33 mycomputer kernel: [31175.824074] ata1.00: failed command: READ FPDMA QUEUED
Sep 30 17:30:33 mycomputer kernel: [31175.824086] ata1.00: cmd 60/40:00:00:15:7e/00:00:38:00:00/40 tag 0 ncq 32768 in
Sep 30 17:30:33 mycomputer kernel: [31175.824089]          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 30 17:30:33 mycomputer kernel: [31175.824095] ata1.00: status: { DRDY }
Sep 30 17:30:33 mycomputer kernel: [31175.824100] ata1.00: failed command: WRITE FPDMA QUEUED
Sep 30 17:30:33 mycomputer kernel: [31175.824110] ata1.00: cmd 61/20:08:18:30:f9/00:00:34:00:00/40 tag 1 ncq 16384 out
Sep 30 17:30:33 mycomputer kernel: [31175.824113]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 30 17:30:33 mycomputer kernel: [31175.824119] ata1.00: status: { DRDY }
Sep 30 17:30:33 mycomputer kernel: [31175.824124] ata1.00: failed command: WRITE FPDMA QUEUED
Sep 30 17:30:33 mycomputer kernel: [31175.824133] ata1.00: cmd 61/18:10:20:1b:0b/00:00:32:00:00/40 tag 2 ncq 12288 out
Sep 30 17:30:33 mycomputer kernel: [31175.824136]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 30 17:30:33 mycomputer kernel: [31175.824140] ata1.00: status: { DRDY }
Sep 30 17:30:33 mycomputer kernel: [31175.824149] ata1: hard resetting link
Sep 30 17:30:34 mycomputer kernel: [31176.316050] ata1: softreset failed (device not ready)
Sep 30 17:30:34 mycomputer kernel: [31176.316059] ata1: applying PMP SRST workaround and retrying
Sep 30 17:30:34 mycomputer kernel: [31176.488053] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 30 17:30:34 mycomputer kernel: [31176.511416] ata1.00: configured for UDMA/133
Sep 30 17:30:34 mycomputer kernel: [31176.524026] ata1.00: device reported invalid CHS sector 0
Sep 30 17:30:34 mycomputer kernel: [31176.524033] ata1.00: device reported invalid CHS sector 0
Sep 30 17:30:34 mycomputer kernel: [31176.524038] ata1.00: device reported invalid CHS sector 0
Sep 30 17:30:34 mycomputer kernel: [31176.524052] ata1: EH complete
Sep 30 17:31:04 mycomputer kernel: [31206.880065] ata1.00: exception Emask 0x0 SAct 0x4 SErr 0x0 action 0x6 frozen
Sep 30 17:31:04 mycomputer kernel: [31206.880185] ata1.00: failed command: READ FPDMA QUEUED
Sep 30 17:31:04 mycomputer kernel: [31206.880267] ata1.00: cmd 60/40:10:00:15:7e/00:00:38:00:00/40 tag 2 ncq 32768 in
Sep 30 17:31:04 mycomputer kernel: [31206.880269]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 30 17:31:04 mycomputer kernel: [31206.880460] ata1.00: status: { DRDY }
Sep 30 17:31:04 mycomputer kernel: [31206.880520] ata1: hard resetting link
Sep 30 17:31:05 mycomputer kernel: [31207.372037] ata1: softreset failed (device not ready)
Sep 30 17:31:05 mycomputer kernel: [31207.372135] ata1: applying PMP SRST workaround and retrying
Sep 30 17:31:05 mycomputer kernel: [31207.544059] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 30 17:31:05 mycomputer kernel: [31207.574052] ata1.00: configured for UDMA/133
Sep 30 17:31:05 mycomputer kernel: [31207.588038] ata1.00: device reported invalid CHS sector 0
Sep 30 17:31:05 mycomputer kernel: [31207.588053] ata1: EH complete
Sep 30 17:31:35 mycomputer kernel: [31237.856070] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Sep 30 17:31:35 mycomputer kernel: [31237.856191] ata1.00Sep 30 20:57:34 mycomputer kernel: imklog 5.8.6, log source = /proc/kmsg started.
Here are the results from
Code:
sudo smartctl -a /dev/sda
after an extended selftest:
Code:
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-31-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green
Device Model:     WDC WD5000AACS-00ZUB0
Serial Number:    WD-WCASU2847586
LU WWN Device Id: 5 0014ee 2014e7ec9
Firmware Version: 01.01B01
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Oct  6 14:59:14 2012 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
                                        was suspended by an interrupting command from host.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (13980) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 163) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   198   197   051    Pre-fail  Always       -       31671
  3 Spin_Up_Time            0x0003   167   161   021    Pre-fail  Always       -       4633
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3304
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15556
 10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   097   097   000    Old_age   Always       -       3300
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       134
193 Load_Cycle_Count        0x0032   016   016   000    Old_age   Always       -       552064
194 Temperature_Celsius     0x0022   112   100   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   199   000    Old_age   Always       -       33
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     15555         -
# 2  Extended offline    Completed without error       00%     15455         -
# 3  Short offline       Completed without error       00%     15452         -
# 4  Conveyance offline  Completed without error       00%     15439         -
# 5  Extended offline    Completed without error       00%     15255         -
# 6  Short offline       Completed without error       00%     15247         -
# 7  Short offline       Completed without error       00%     14841         -
# 8  Extended offline    Completed without error       00%     12729         -
# 9  Short offline       Completed without error       00%     12649         -
#10  Extended offline    Completed without error       00%     12266         -
#11  Short offline       Completed without error       00%     12264         -
#12  Short offline       Aborted by host               10%     12263         -
#13  Conveyance offline  Completed without error       00%     11071         -
#14  Extended offline    Completed without error       00%     11070         -
#15  Short offline       Completed without error       00%     11016         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.