Debugging NVIDIA NVRM Xid Errors
Has anyone had any joy in trying to debug these? Plus I want to get an idea how many others are suffering from NVIDIA's Xid errors.
Since I bought an Asus 9800GT, I've been having a series of (entirely) unpredictable hard crashes. Under my previous 7600GS, I had no issues. They can occur from doing anything: switching between 3D applications and other video-intensive tasks, to having almost no screen/cpu activity. Crashes typically involve a loss of all forms of KB control: magic keys do not work, but typically you'll still be able to move the cursor (to no effect). Some syslog examples:
(driver: NVIDIA 177.82)
Code:
Mar 28 02:02:01 ace1 kernel: [177343.453159] NVRM: Xid (0001:00): 13, 0003 00000000 00008297 00001458 00000006 00000100
Mar 31 05:03:49 ace1 kernel: [262758.392403] NVRM: Xid (0001:00): 13, 0003 00000000 00008297 00001458 00000006 00000100
May 16 05:50:36 ace1 kernel: [2189363.352569] NVRM: Xid (0001:00): 8, Channel 00000001
May 16 05:51:46 ace1 kernel: [2189428.300501] BUG: soft lockup - CPU#1 stuck for 61s! [Xorg:6006]
May 16 05:51:46 ace1 kernel: [2189428.300501] Pid: 6006, comm: Xorg Tainted: P (2.6.27-11-generic #1)
May 16 05:51:46 ace1 kernel: [2189428.300501] EIP: 0060:[<f943e440>] EFLAGS: 00203293 CPU: 1
May 16 05:51:46 ace1 kernel: [2189428.300501] EIP is at _nv009108rm+0x197/0x1a0 [nvidia]
May 16 05:51:46 ace1 kernel: [2189428.300501] EAX: 5132d9f1 EBX: 00000000 ECX: f6215d94 EDX: 00046a00
May 16 05:51:46 ace1 kernel: [2189428.300501] ESI: f6215dc8 EDI: 00000000 EBP: f6215d90 ESP: f5437cb4
May 16 05:51:46 ace1 kernel: [2189428.300501] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
May 16 05:51:46 ace1 kernel: [2189428.300501] CR0: 8005003b CR2: a6760000 CR3: 35ff5000 CR4: 00000690
May 16 05:51:46 ace1 kernel: [2189428.300501] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
May 16 05:51:46 ace1 kernel: [2189428.300501] DR6: ffff0ff0 DR7: 00000400
[soft lockup message + reg dump repeats every 60s until reboot]
May 19 00:52:28 ace1 kernel: [ 598.580534] NVRM: Xid (0001:00): 8, Channel 00000003
May 19 00:52:41 ace1 kernel: [ 611.584531] NVRM: Xid (0001:00): 8, Channel 00000001
Given I can appear to go for long periods of time (months) without any Xids, this makes debugging these problems terribly difficult. At the moment I'm back on my 7600GS following the 16th May hard crash, which I initially put down to HDD failure due to the extent of the crash and messages from the mobo. Now, however, after some (frantic) coaxing, the HDD is back up and running, and reporting A-OK from SMART status to surface scans and fs checks. Then I found the original Xid in the log (listed above) which caused the crash. Why my motherboard continues to warn of SMART failure on the HDD in question every X boots, I'm not sure. Perhaps the Xid dealt more damage than I can trace, as of yet.
All NVIDIA seems to say, and all they ever ask for & complain about people not doing, is to run nvidia-bug-report.sh, which is of no real help. It just takes some log entires and other hw info such as loaded modules, and sticks all the data in a file. The most useful thing in it is the Xid value(s) which are in syslog (& normally kern.log) anyway. If anyone can offer any advice that doesn't involve nvidia-bug-report, I'd be very grateful.
Otherwise, I'll be returning my 9800GT: I have no use for hardware+driver combinations that either a) wreck my system or b) fail to provide 3d accn.
Bottom line: not impressed with my move to NVIDIA. Will likely return to ATI shortly.
Ace1 FreeBSD/Gnome 2, i5 2300, 16GB, HX750W, 20TB ZFS pool, 60GB SSD, Fractal Design XL
Ace2 Ubuntu/Xubuntu, i7 2600, 16GB, HX850W, 4TB, Asus HD6970, Fractal Design R3
Ace3 Ubuntu/XFCE, E7200, 4GB, OCZ GameXStream 700W, 8TB
Bookmarks