Page 1 of 5 123 ... LastLast
Results 1 to 10 of 42

Thread: Debugging NVIDIA NVRM Xid Errors

  1. #1
    Join Date
    Apr 2008
    Location
    UK
    Beans
    496
    Distro
    Ubuntu 12.04 Precise Pangolin

    Unhappy Debugging NVIDIA NVRM Xid Errors

    Has anyone had any joy in trying to debug these? Plus I want to get an idea how many others are suffering from NVIDIA's Xid errors.

    Since I bought an Asus 9800GT, I've been having a series of (entirely) unpredictable hard crashes. Under my previous 7600GS, I had no issues. They can occur from doing anything: switching between 3D applications and other video-intensive tasks, to having almost no screen/cpu activity. Crashes typically involve a loss of all forms of KB control: magic keys do not work, but typically you'll still be able to move the cursor (to no effect). Some syslog examples:

    (driver: NVIDIA 177.82)

    Code:
    Mar 28 02:02:01 ace1 kernel: [177343.453159] NVRM: Xid (0001:00): 13, 0003 00000000 00008297 00001458 00000006 00000100
    
    Mar 31 05:03:49 ace1 kernel: [262758.392403] NVRM: Xid (0001:00): 13, 0003 00000000 00008297 00001458 00000006 00000100
    
    May 16 05:50:36 ace1 kernel: [2189363.352569] NVRM: Xid (0001:00): 8, Channel 00000001
    May 16 05:51:46 ace1 kernel: [2189428.300501] BUG: soft lockup - CPU#1 stuck for 61s! [Xorg:6006]
    May 16 05:51:46 ace1 kernel: [2189428.300501] Pid: 6006, comm: Xorg Tainted: P          (2.6.27-11-generic #1)
    May 16 05:51:46 ace1 kernel: [2189428.300501] EIP: 0060:[<f943e440>] EFLAGS: 00203293 CPU: 1
    May 16 05:51:46 ace1 kernel: [2189428.300501] EIP is at _nv009108rm+0x197/0x1a0 [nvidia]
    May 16 05:51:46 ace1 kernel: [2189428.300501] EAX: 5132d9f1 EBX: 00000000 ECX: f6215d94 EDX: 00046a00
    May 16 05:51:46 ace1 kernel: [2189428.300501] ESI: f6215dc8 EDI: 00000000 EBP: f6215d90 ESP: f5437cb4
    May 16 05:51:46 ace1 kernel: [2189428.300501]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
    May 16 05:51:46 ace1 kernel: [2189428.300501] CR0: 8005003b CR2: a6760000 CR3: 35ff5000 CR4: 00000690
    May 16 05:51:46 ace1 kernel: [2189428.300501] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    May 16 05:51:46 ace1 kernel: [2189428.300501] DR6: ffff0ff0 DR7: 00000400
    [soft lockup message + reg dump repeats every 60s until reboot]
    
    May 19 00:52:28 ace1 kernel: [  598.580534] NVRM: Xid (0001:00): 8, Channel 00000003
    May 19 00:52:41 ace1 kernel: [  611.584531] NVRM: Xid (0001:00): 8, Channel 00000001
    Given I can appear to go for long periods of time (months) without any Xids, this makes debugging these problems terribly difficult. At the moment I'm back on my 7600GS following the 16th May hard crash, which I initially put down to HDD failure due to the extent of the crash and messages from the mobo. Now, however, after some (frantic) coaxing, the HDD is back up and running, and reporting A-OK from SMART status to surface scans and fs checks. Then I found the original Xid in the log (listed above) which caused the crash. Why my motherboard continues to warn of SMART failure on the HDD in question every X boots, I'm not sure. Perhaps the Xid dealt more damage than I can trace, as of yet.

    All NVIDIA seems to say, and all they ever ask for & complain about people not doing, is to run nvidia-bug-report.sh, which is of no real help. It just takes some log entires and other hw info such as loaded modules, and sticks all the data in a file. The most useful thing in it is the Xid value(s) which are in syslog (& normally kern.log) anyway. If anyone can offer any advice that doesn't involve nvidia-bug-report, I'd be very grateful.

    Otherwise, I'll be returning my 9800GT: I have no use for hardware+driver combinations that either a) wreck my system or b) fail to provide 3d accn.

    Bottom line: not impressed with my move to NVIDIA. Will likely return to ATI shortly.
    Ace1 FreeBSD/Gnome 2, i5 2300, 16GB, HX750W, 20TB ZFS pool, 60GB SSD, Fractal Design XL
    Ace2 Ubuntu/Xubuntu, i7 2600, 16GB, HX850W, 4TB, Asus HD6970, Fractal Design R3
    Ace3 Ubuntu/XFCE, E7200, 4GB, OCZ GameXStream 700W, 8TB

  2. #2
    Join Date
    Jan 2009
    Beans
    3
    Distro
    Kubuntu 8.10 Intrepid Ibex

    Re: Debugging NVIDIA NVRM Xid Errors

    Damn. Similar problems on similar channels. I'm running a BFG 6200oc PCI card on a desk top.

    dmesg | grep -i nv returns

    Code:
    [    0.000000]  BIOS-e820: 000000007fe70000 - 000000007fe72000 (ACPI NVS)
    [   17.531820] Simple Boot Flag value 0x87 read from CMOS RAM was invalid
    [   32.742260] nvidia: module license 'NVIDIA' taints kernel.
    [   33.665918] NVRM: loading NVIDIA UNIX x86 Kernel Module  169.12  Thu Feb 14 17:53:07 PST 2008
    [  877.468540] NVRM: Xid (0001:00): 8, Channel 00000003
    [  983.579839] NVRM: Xid (0001:00): 8, Channel 00000003

  3. #3
    Join Date
    Aug 2007
    Beans
    4

    Re: Debugging NVIDIA NVRM Xid Errors

    I have an Asus P5N7A-VM with NVIDIA 9300

    I get the same, freezes, must reboot

    Jun 15 20:07:20 NightFlyer NetworkManager: <info> Activation (wlan0) successful, device activated.
    Jun 15 20:07:20 NightFlyer NetworkManager: <info> Activation (wlan0) Stage 5 of 5 (IP Configure Commit) complete.
    Jun 15 20:07:20 NightFlyer ntpdate[4114]: adjust time server 91.189.94.4 offset 0.433523 sec
    Jun 15 20:07:26 NightFlyer kernel: [ 54.536036] wlan0: no IPv6 routers present
    Jun 15 20:08:36 NightFlyer kernel: [ 123.868071] NVRM: Xid (0003:00): 13, 0001 00000000 0000502d 00000104 00000000 00000100
    Jun 15 20:08:36 NightFlyer kernel: [ 123.884637] NVRM: Xid (0003:00): 13, 0003 00000000 00008397 00001408 00000001 00000040
    Jun 15 20:08:37 NightFlyer kernel: [ 124.945726] NVRM: Xid (0003:00): 13, 0003 00000000 00008397 000015e0 00000000 00000040
    Jun 15 20:08:37 NightFlyer kernel: [ 124.960615] NVRM: Xid (0003:00): 13, 0001 00000000 0000502d 00000104 00000000 00000100
    Jun 15 20:08:37 NightFlyer kernel: [ 125.106931] NVRM: Xid (0003:00): 13, 0001 00000000 0000502d 00000104 00000000 00000100
    Jun 15 20:09:53 NightFlyer syslogd 1.5.0#5ubuntu3: restart.
    Jun 15 20:09:53 NightFlyer kernel: Inspecting /boot/System.map-2.6.29-020629-generic
    Jun 15 20:09:53 NightFlyer kernel: Cannot find map file.
    Jun 15 20:09:53 NightFlyer kernel: Loaded 82768 symbols from 52 modules.
    Jun 15 20:09:53 NightFlyer kernel: [ 0.000000] Initializing cgroup subs

  4. #4
    Join Date
    Jul 2009
    Beans
    1

    Re: Debugging NVIDIA NVRM Xid Errors

    I'm also experiencing this, on a Dell Precision M4400 Laptop.
    Graphics adapter: nVidia Corporation Quadro FX 770M

    Sometimes I get the complete kernel panic, with keyboard leds blinking. Then nothing in the logs. Sometimes I get these kind of soft porblems as below, when the graphics driver just seems to crash, and then recover after a while (no reboot involved in the below):



    Code:
    Jul  6 12:09:24 nibbler Synergy 1.3.1: NOTE: CClientProxy1_0.cpp,221: client "kempe" is dead
    Jul  6 12:09:27 nibbler kernel: [45533.372670] NVRM: Xid (0001:00): 8, Channel 00000003
    Jul  6 12:09:27 nibbler Synergy 1.3.1: NOTE: CClientListener.cpp,127: accepted client connection
    Jul  6 12:09:27 nibbler Synergy 1.3.1: NOTE: CServer.cpp,278: client "kempe" has connected
    Jul  6 12:10:01 nibbler /USR/SBIN/CRON[22437]: (root) CMD ([ -x /usr/sbin/update-motd ] && /usr/sbin/update-motd 2>/dev/null)
    Jul  6 12:10:59 nibbler kernel: [45625.388711] NVRM: Xid (0001:00): 8, Channel 00000001
    Jul  6 12:10:59 nibbler kernel: [45625.427060] NVRM: Xid (0001:00): 13, 0003 00000000 00008297 00000f04 00000000 00000040
    Jul  6 12:10:59 nibbler kernel: [45625.428235] NVRM: Xid (0001:00): 13, 0003 00000000 00008297 00000f04 00000000 00000040
    Jul  6 12:10:59 nibbler kernel: [45625.650573] NVRM: Xid (0001:00): 13, 0003 00000000 00008297 000015e0 00000000 00000040
    Jul  6 12:11:15 nibbler kernel: [45641.988185] NVRM: Xid (0001:00): 13, 0003 00000000 00008297 000015e0 00000000 00000040
    Jul  6 12:11:16 nibbler kernel: [45642.269826] NVRM: Xid (0001:00): 13, 0003 00000000 00008297 000015e0 00000000 00000040
    Jul  6 12:11:36 nibbler kernel: [45662.649087] NVRM: Xid (0001:00): 13, 0003 00000000 00008297 00001310 00000000 00000040
    Jul  6 12:13:02 nibbler Synergy 1.3.1: NOTE: CClientProxy1_0.cpp,221: client "kempe" is dead
    Jul  6 12:13:06 nibbler Synergy 1.3.1: NOTE: CClientListener.cpp,127: accepted client connection
    Jul  6 12:13:06 nibbler Synergy 1.3.1: NOTE: CServer.cpp,278: client "kempe" has connected
    Jul  6 12:13:58 nibbler acpid: client 3186[0:0] has disconnected 
    Jul  6 12:13:58 nibbler acpid: client 3186[0:0] has disconnected 
    Jul  6 12:16:00 nibbler acpid: client connected from 3186[0:0] 
    Jul  6 12:16:01 nibbler acpid: client connected from 3186[0:0] 
    Jul  6 12:17:01 nibbler /USR/SBIN/CRON[22996]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
    Jul  6 12:20:01 nibbler /USR/SBIN/CRON[23197]: (root) CMD ([ -x /usr/sbin/update-motd ] && /usr/sbin/update-motd 2>/dev/null)

  5. #5
    Join Date
    May 2009
    Beans
    6
    Distro
    Ubuntu 9.04 Jaunty Jackalope

    Re: Debugging NVIDIA NVRM Xid Errors

    I'm running Ubuntu 9.04 Jaunty Jackalope on an EVGA nForce 680i motherboard with 2x2 GB SLI-ready Corsair Dominator RAM, an Intel Q9400 Core 2 Quad @ 2.66 GHz and a single EVGA 8800GTS nvidia card, and this is my story.

    I had a stable Ubuntu 9.04 system running until sometime around mid July. I'm not sure if the problems were due to an automatic update to the Ubuntu kernels, but I didn't actively change the configuration in any way that would cause my system to begin crashing. I am experiencing similar symptoms to the previous posters; specifically, my screen would corrupt, crash, and require a reboot. I have tried a few things in an attempt to fix the situation, including:

    - reinstalling the nvidia 185.18.29 drivers
    - installing the nvidia 185.18.31 drivers
    - installing the nvidia 190.10 drivers
    - reinstalling xorg
    - reverting nvidia drivers back to 185.18.29
    - reverting the nvidia drivers back to 180.60
    - adding 'Option "NvAGP" "1"' to my xorg.conf file
    - changing 'Option "NvAGP" "1"' to 'Option "NvAGP" "0"' (system won't boot)

    Currently, my system is meta-stable, in that it will run for extended periods of time with only the occasional screen flicker (which might be a re-upload of the kernel to the graphics card?). The split-second blackscreens are usually accompanied by an entry of the form NVRM: Xid 6 / 13 in the kern.log file. Does anyone have any leads on how to pinpoint the cause of these failures? I've been troubleshooting this for days now with seemingly no progress. For completeness' sake, below are some useful outputs, in case anyone can understand the error codes contained within, or happens to notice a problematic statement with the configuration. Any leads or help would be greatly appreciated. Again, the system was stable, with no active changes on my part, until "something happened". I did install the CUDA libraries around that time, and run some examples, but I can imagine that any of that could have actually changed the configuration of the GPU kernel.

    xorg.conf device section:

    Section "Device"
    Identifier "Device0"
    Driver "nvidia"
    VendorName "NVIDIA Corporation"
    Option "NvAGP" "1"
    EndSection

    cat /var/log/Xorg.0.log | grep -i agp:

    (**) NVIDIA(0): Option "NvAGP" "1"
    (**) Aug 18 10:42:52 NVIDIA(0): Use of NVIDIA internal AGP requested

    cat /var/log/Xorg.0.log | grep -i gart:

    (II) Aug 18 10:42:54 NVIDIA(0): Initialized GPU GART.

    cat /var/log/kern.log | grep -i nv: (multiple instances similar to the following):

    [ 0.000000] BIOS-e820: 000000007fef0000 - 000000007fef3000 (ACPI NVS)
    [ 0.000000] modified: 000000007fef0000 - 000000007fef3000 (ACPI NVS)
    [ 0.000000] ACPI: RSDP 000F7FA0, 0014 (r0 Nvidia)
    [ 0.000000] ACPI: RSDT 7FEF3040, 0038 (r1 Nvidia NVDAACPI 42302E31 NVDA 0)
    [ 0.000000] ACPI: FACP 7FEF30C0, 0074 (r1 Nvidia NVDAACPI 42302E31 NVDA 0)
    [ 0.000000] ACPI: DSDT 7FEF3180, 541D (r1 NVIDIA NVDAACPI 1000 MSFT 3000000)
    [ 0.000000] ACPI: HPET 7FEF8700, 0038 (r1 Nvidia NVDAACPI 42302E31 NVDA 98)
    [ 0.000000] ACPI: WDRT 7FEF8780, 0047 (r1 Nvidia NVDAACPI 42302E31 NVDA 0)
    [ 0.000000] ACPI: MCFG 7FEF8840, 003C (r1 Nvidia NVDAACPI 42302E31 NVDA 0)
    [ 0.000000] ACPI: APIC 7FEF8600, 0098 (r1 Nvidia NVDAACPI 42302E31 NVDA 0)
    [ 1.561597] sata_nv 0000:00:0e.0: version 3.5
    [ 1.561838] sata_nv 0000:00:0e.0: PCI INT A -> Link[ASA0] -> GSI 21 (level, low) -> IRQ 21
    [ 1.561840] sata_nv 0000:00:0e.0: Using SWNCQ mode
    [ 1.562044] sata_nv 0000:00:0e.0: setting latency timer to 64
    [ 1.562163] scsi0 : sata_nv
    [ 1.562249] scsi1 : sata_nv
    [ 3.026587] sata_nv 0000:00:0e.1: PCI INT B -> Link[ASA1] -> GSI 20 (level, low) -> IRQ 20
    [ 3.026590] sata_nv 0000:00:0e.1: Using SWNCQ mode
    [ 3.026614] sata_nv 0000:00:0e.1: setting latency timer to 64
    [ 3.026696] scsi2 : sata_nv
    [ 3.026734] scsi3 : sata_nv
    [ 4.758723] sata_nv 0000:00:0e.2: PCI INT C -> Link[ASA2] -> GSI 21 (level, low) -> IRQ 21
    [ 4.758725] sata_nv 0000:00:0e.2: Using SWNCQ mode
    [ 4.758749] sata_nv 0000:00:0e.2: setting latency timer to 64
    [ 4.758832] scsi4 : sata_nv
    [ 4.758902] scsi5 : sata_nv
    [ 6.668360] ata7: nv_mode_filter: 0x739f&0x701f->0x701f, BIOS=0x7000 (0xc0c00000) ACPI=0x701f (60:60:0x1f)
    [ 6.668363] ata7: nv_mode_filter: 0x739f&0x701f->0x701f, BIOS=0x7000 (0xc0c00000) ACPI=0x701f (60:60:0x1f)
    [ 6.800136] rtc0: alarms up to one year, y3k, 114 bytes nvram, hpet irqs
    [ 6.958988] nv_probe: set workaround bit for reversed mac addr
    [ 7.478772] nv_probe: set workaround bit for reversed mac addr
    [ 9.961987] generic-usb 0003:051D:0002.0003: hiddev96,hidraw2: USB HID v1.10 Device [American Power Conversion Back-UPS XS 1500 LCD FW:837.H5 .D USB FW:H5 ] on usb-0000:00:0b.0-3/input0
    [ 24.831822] nvidia: module license 'NVIDIA' taints kernel.
    [ 25.084172] nvidia 0000:01:00.0: PCI INT A -> Link[AXV5] -> GSI 16 (level, low) -> IRQ 16
    [ 25.084177] nvidia 0000:01:00.0: setting latency timer to 64

    [ 25.084304] NVRM: loading NVIDIA UNIX x86 Kernel Module 190.18 Wed Jul 22 18:30:32 PDT 2009
    [ 2754.441266] NVRM: Xid (0001:00): 6, PE0001
    [ 2893.097535] NVRM: Xid (0001:00): 13, 0003 00000000 00005097 000019c4 00040000 00000005
    [ 2893.101190] NVRM: Xid (0001:00): 13, 0003 00000000 00005097 000019c4 00040000 00000005
    [ 2893.899461] NVRM: Xid (0001:00): 13, 0001 00000000 00005097 00000200 000c0000 0000000c
    [ 2893.903119] NVRM: Xid (0001:00): 13, 0001 00000000 00005097 00000200 000c0000 0000000c
    [ 3129.872050] NVRM: Xid (0001:00): 13, 0001 00000000 00005097 00001458 00ff0001 00000003
    [ 3129.875691] NVRM: Xid (0001:00): 13, 0001 00000000 00005097 00001458 00ff0001 00000003
    [ 3301.118756] NVRM: Xid (0001:00): 6, PE0003
    Last edited by Ostomizer; August 18th, 2009 at 08:02 PM.

  6. #6
    Join Date
    May 2009
    Beans
    6
    Distro
    Ubuntu 9.04 Jaunty Jackalope

    Re: Debugging NVIDIA NVRM Xid Errors

    In case anyone is having this problem, I suspect it may be related to voltage supplied to the video card. To recap, the computer experiencing the display corruption / "nvrm xid" errors is composed of:














    • 7200 RPM 320 GB and 1.5 TB Seagate Barracuda drives


    I hadn't experienced any display corruption on this computer until after I had 1) installed Ubuntu and 2) upgraded my core 2 duo to a core 2 quad. I've recently noticed that the large amount of noise coming from my power supply is actually a periodic high-pitched whining / beating noise, and that it only begins to make this noise after the computer has been on for at least a short while, after having been off for a long time. If I soft reboot the computer, the noise does not go away. Based on the Newegg reviews of the case & power supply combo I bought about 2 years ago, the power supply seems to be bad quality. I have just ordered a CORSAIR CMPSU-620HX 620W ATX12V v2.2, and will report back on whether or not this solves my display corruption issue. I have never known much about, or discriminated much among available power supplies, but after having read some threads elsewhere about GPU undervoltage causing graphical display corruption, "bad power supply" seems like a plausible root cause. For anyone actually stumbling across this thread trying to solve the same problem, I'll report back here after the new supply arrives.

  7. #7
    Join Date
    May 2009
    Beans
    6
    Distro
    Ubuntu 9.04 Jaunty Jackalope

    Re: Debugging NVIDIA NVRM Xid Errors

    Installed the new supply today. No more beeping, runs smooth as butter.

  8. #8
    Join Date
    May 2009
    Beans
    6
    Distro
    Ubuntu 9.04 Jaunty Jackalope

    Re: Debugging NVIDIA NVRM Xid Errors

    Unfortunately,

    Sep 3 08:42:42 desktop kernel: [231637.717161] NVRM: Xid (0001:00): 2, CCMDs 00000003 00005072 00000174 04700438 00000006
    Sep 3 08:44:23 desktop kernel: [231739.444257] NVRM: Xid (0001:00): 6, PE0003
    Sep 3 08:44:23 desktop kernel: [231739.452618] NVRM: Xid (0001:00): 6, PE0001
    Sep 3 08:53:35 desktop kernel: [232291.393932] NVRM: Xid (0001:00): 6, PE0001
    Sep 3 08:53:40 desktop kernel: [232295.678599] NVRM: Xid (0001:00): 6, PE0003
    Sep 3 09:34:19 desktop kernel: [234734.771551] NVRM: Xid (0001:00): 6, PE0003
    Sep 3 09:35:26 desktop kernel: [234802.323266] NVRM: Xid (0001:00): 13, 0003 00000000 00005097 0000192c 00d80001 0000000c
    Sep 3 09:35:26 desktop kernel: [234802.326910] NVRM: Xid (0001:00): 13, 0003 00000000 00005097 0000192c 00d80001 0000000c

    new power supply didn't fix the problem. It only occurs after the computer's been up and running for a while (1+ days), so maybe it's a heating issue. Any suggestions would be mighty welcome.

  9. #9
    Join Date
    Feb 2008
    Location
    Italy
    Beans
    34
    Distro
    Ubuntu 10.10 Maverick Meerkat

    Re: Debugging NVIDIA NVRM Xid Errors

    I have the same problem since some days. Mainly the error comes when I'm running sdlmame and I try to go to fullscreen. Could be that it is linked to sdl layer.
    I tried NvAGP = 1, 2 or 3 but in /proc/driver/nvidia/registry the NvAGP option is always set to 3 that is the system try to open the kernel AGP GART and if not successful it opens the NVIDIA GART.
    I'm stuck

    Gianluca
    Be free! Use your mind!

  10. #10
    Join Date
    May 2009
    Beans
    6
    Distro
    Ubuntu 9.04 Jaunty Jackalope

    Re: Debugging NVIDIA NVRM Xid Errors

    I found more information in forums through googling that pointed to dual-channel RAM causing this type of problem. I went into my BIOS, disabled the "SLI-ENABLED" setting, and I haven't had the problem since. Do you have SLI memory / dual channel setup?

Page 1 of 5 123 ... LastLast

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •