Use the drive manufacturer's diagnostic tools.
The only things I can think of is that you didn't properly align your sectors on 4K boundaries. That seems to make it hard for HDDs. Also, SMART isn't 100% accurate. I watch mine weekly running short tests and long tests monthly, then get the reports and look for changes in the reports over time. That's really the only way to predict a failure that I know. A single test run yearly isn't sufficient to see the changes.
For example, the normal things in the SMART reports with pending and sector errors were all fine with the last HDD that failed here. But here's the last report I used to get the vendor to approve an RMA:
Code:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 001 001 051 Pre-fail Always FAILING_NOW 87106
3 Spin_Up_Time 0x0027 161 115 021 Pre-fail Always - 10933
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 20
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6133
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 14
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 7
194 Temperature_Celsius 0x0022 105 091 000 Old_age Always - 47
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 198 198 000 Old_age Offline - 1819
Nothing in those to make me worry about data corruption. Well, not when this initially started. Over time,
Code:
$ egrep Raw_Read_Error_Rate smart.202*sda
smart.2023-10-10.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2023-10-17.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2023-10-24.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2023-10-31.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2023-11-07.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
smart.2023-11-14.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
smart.2023-11-21.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
smart.2023-11-28.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 1
smart.2023-12-05.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2023-12-12.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2023-12-19.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2023-12-26.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2024-01-02.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 5
smart.2024-01-09.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2024-01-16.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 3
smart.2024-01-23.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2024-01-30.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
smart.2024-02-06.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 5
smart.2024-02-13.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2
smart.2024-02-20.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2
smart.2024-02-27.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 2
smart.2024-03-05.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 7
smart.2024-03-12.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 10
smart.2024-03-19.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 20
smart.2024-03-26.sda: 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 61
smart.2024-04-02.sda: 1 Raw_Read_Error_Rate 0x002f 117 117 051 Pre-fail Always - 3184
smart.2024-04-07.sda: 1 Raw_Read_Error_Rate 0x002f 001 001 051 Pre-fail Always FAILING_NOW 87106
smart.2024-04-09.sda: 1 Raw_Read_Error_Rate 0x002f 001 001 051 Pre-fail Always FAILING_NOW 87095
smart.2024-04-16.sda: 1 Raw_Read_Error_Rate 0x002f 200 001 051 Pre-fail Always In_the_past 5
smart.2024-04-23.sda: 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
That last line is from a new HDD.
Kept getting worse and worse, until the drive failed. When it did fail, it was already replaced and being accessed using a USB2 dock, being wiped with random data. It has been shipped for RMA.
Initially, it became slow, very slow, with writing. Then reading from the files that were slow to write was also REALLY slow. Other files were fine. That made me look at the SMART data more carefully, since the HDD was only 8 months old and came new with a 5 yr warranty. I stopped buying HDDs with less than 5 yr warranties about 3-4 yrs ago. The inconvenience of dealing with data issues more than about once a decade is just too much hassle for me. Anyway, since there weren't any reallocated events or pending, I ensured all the data was backed up to other disks and reformatted it with a fresh ext4, then moved all the data back. To get the data initially moved off, a simple copy was failing, so I used ddrescue on a file-by-file basis. If there were 100 files, then over 99 of them moved quickly, but that last 1% ran overnight.
I also was monitoring the drive temperature. It was warm, but not hot.
BTW, I really do run those SMART tests weekly:
Code:
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 6013 -
# 2 Short offline Completed without error 00% 5832 -
# 3 Short offline Completed without error 00% 5664 -
# 4 Short offline Completed without error 00% 5497 -
# 5 Extended offline Completed without error 00% 5342 -
# 6 Short offline Completed without error 00% 5162 -
# 7 Short offline Completed without error 00% 4995 -
# 8 Short offline Completed without error 00% 4827 -
# 9 Extended offline Completed without error 00% 4670 -
#10 Short offline Completed without error 00% 4491 -
#11 Short offline Completed without error 00% 4323 -
#12 Short offline Completed without error 00% 4155 -
#13 Short offline Completed without error 00% 3988 -
#14 Extended offline Completed without error 00% 3832 -
#15 Short offline Completed without error 00% 3652 -
#16 Short offline Completed without error 00% 3484 -
#17 Short offline Completed without error 00% 3316 -
#18 Extended offline Completed without error 00% 3160 -
#19 Short offline Completed without error 00% 2981 -
#20 Short offline Completed without error 00% 2813 -
#21 Short offline Completed without error 00% 2645 -
That's a long test the first Monday of every month and short tests every other Monday.
See how looking at the data over time let me be proactive? In the end, I didn't lose any data, even with 1 new file being inaccessible when the problem first began.
I should also mention, that disk was for scratch use, not archival of stuff, so I didn't have great daily backups, like I do with all other data. Most of the data was being migrated from an old RAID setup to this drive and I just got bogged down. I didn't delete the RAID data, which is why almost nothing was lost that wasn't in the "scratch" area.
Drives making noise is never good. Start looking more closely at the smart reports and testing weekly. You won't know the problem until it is an emergency, but you need to be prepared. If it were me, I'd move a noisy disk that still worked to be a backup and put in a new disk that's quiet for the primary.
Bookmarks