NTFS isn't native to Linux. It has been reverse engineered without help from the creators, MSFT. So, it isn't strange that the reverse-engineered, barely working, access to a foreign file system has issues and cannot correct every possible issue.
We always say, use the OS native to the file system to correct any issues.
NTFS ----> MS-Windows
ext4 ---> Linux
After all, would you expect MS-Windows to be able to fix ext4, jfs, xfs, zfs, btrfs, f2fs, or the other 30 file systems available as native to Linux?
Now, if MSFT wants to release all their NTFS source code under a F/LOSS license, there could be hope, but you shouldn't hold your breath. I won't be holding mine.
NTFS is a journaled, modern, file system just like ext3, ext4, xfs, jfs, and all the other journaled native file systems on Linux. It does actually take effort + luck to corrupt data in any of those file systems. Journalling makes that nearly impossible, unless there are other, hardware, issues happening too.
The way the disk errors begin is typically with slower performance or clearly corrupt director listings or data inside files. My troubleshooting starts with the easy stuff first.
- I umounted the file system and run an fsck. That didn't find logical issues in the file system.
- Then move all the files in that file system to some spare storage and formatted it fresh, then copied the files back. This was impacted by the slow/poor performance, which lead to ....
To get the files off without corruption, I ended up using
ddrescue on each file. Left it running overnight. It was really just stuck on 1 of the 250 files. In the morning, it was still stuck so I moved the "stuck file" to the end of the ddrescue script and started it over. All but the last file finished in about 10 minutes. The last file, which was about 2.8GB, took a few hours, but eventually finished.
- Only then did I run the SMART tests and see the "FAILING NOW" line. I usually pull HDDs from service before they fail.
BTW, I have excellent backups for most stuff. I don't backup "working areas" - places where files that need more processing are left. This file system is one of those "working areas". SMART testing is good for a number of reasons.
About 2 weeks ago, I saw some slowdowns for a relatively new (8 months) WD-Black HDD with a 5 yr warranty. I've been using that type of HDD for decades and never had any of them fail. Technically, the one that is failing now hasn't failed, yet, but I already have an approved RMA to return it. Here's the SMART report:
Code:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.15.0-101-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: WDC WD8002FZWX-00BKUA0
Serial Number: xxxxxxxxxxxxxx
LU WWN Device Id: x xxxxxx xxxxxxxxxx
Firmware Version: 02.01A02
User Capacity: 8,001,563,222,016 bytes [8.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Apr 7 18:19:03 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
...
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 001 001 051 Pre-fail Always FAILING_NOW 87106
3 Spin_Up_Time 0x0027 161 115 021 Pre-fail Always - 10933
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 20
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6133
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 20
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 14
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 7
194 Temperature_Celsius 0x0022 105 091 000 Old_age Always - 47
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 198 198 000 Old_age Offline - 1819
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 6013 -
# 2 Short offline Completed without error 00% 5832 -
# 3 Short offline Completed without error 00% 5664 -
# 4 Short offline Completed without error 00% 5497 -
# 5 Extended offline Completed without error 00% 5342 -
# 6 Short offline Completed without error 00% 5162 -
# 7 Short offline Completed without error 00% 4995 -
# 8 Short offline Completed without error 00% 4827 -
# 9 Extended offline Completed without error 00% 4670 -
#10 Short offline Completed without error 00% 4491 -
#11 Short offline Completed without error 00% 4323 -
#12 Short offline Completed without error 00% 4155 -
#13 Short offline Completed without error 00% 3988 -
#14 Extended offline Completed without error 00% 3832 -
#15 Short offline Completed without error 00% 3652 -
#16 Short offline Completed without error 00% 3484 -
#17 Short offline Completed without error 00% 3316 -
#18 Extended offline Completed without error 00% 3160 -
#19 Short offline Completed without error 00% 2981 -
#20 Short offline Completed without error 00% 2813 -
#21 Short offline Completed without error 00% 2645 -
There are 3 parts to the report.
a) Drive type, model, S/N, ..... yadda, yadda, yadda.
b) Details for the current test run
c) Historical data for prior test runs.
Looking at the current test results, everything looks file except IDs: 1, 2, and 200. I've never seen a drive fail in that way. Usually all spare sectors run out and the Reallocated_Sector_Ct, Reallocated_Event_Count and the Current_Pending_Sector numbers are non-zero. That didn't happen for this failure.
I run short tests weekly and long tests monthly (the 1st week of each month), as you can see. Also, I retain those test logs for many weeks so as things degrade, I can see those changes from week to week. When something bad is notice, SMARTmon lets me know, daily. The email looks like this:
Code:
This message was generated by the smartd daemon running on:
host name: hadar
DNS domain: [Empty]
The following warning/error was logged by the smartd daemon:
Device: /dev/sda [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Device info:
WDC WD8002FZWX-00BKUA0, S/N:xxxxxxxxxxxxxx, WWN:x-xxxxxx-xxxxxxxxxx, FW:02.01A02, 8.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Wed Apr 3 14:23:58 2024 EDT
Another message will be sent in 24 hours if the problem persists.