Has the OP posted the system logs showing problems? The last 3 weeks or so, my 18.04 VM host has had some stability issues. I've narrowed it down to either the kernel line (not just 1 kernel) or bad RAM.
On my system, after a crash, I can see the stack trace in logs from the prior crashes using journalctl -b -2. The failures look like this:
Code:
Jul 26 23:43:05 hadar kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B450-F
Jul 26 23:43:05 hadar kernel: Call Trace:
Jul 26 23:43:05 hadar kernel: dump_stack+0x6d/0x8b
Jul 26 23:43:05 hadar kernel: bad_page+0xcb/0x120
Jul 26 23:43:05 hadar kernel: free_pages_check_bad+0x5f/0x70
Jul 26 23:43:05 hadar kernel: free_pcppages_bulk+0x472/0x6d0
Jul 26 23:43:05 hadar kernel: ? page_counter_cancel+0x23/0x30
Jul 26 23:43:05 hadar kernel: free_unref_page_commit+0xb9/0xe0
Jul 26 23:43:05 hadar kernel: free_unref_page_list+0x108/0x190
Jul 26 23:43:05 hadar kernel: shrink_page_list+0x379/0xbb0
Jul 26 23:43:05 hadar kernel: shrink_inactive_list+0x204/0x3d0
Jul 26 23:43:05 hadar kernel: shrink_node_memcg+0x3b4/0x820
Jul 26 23:43:05 hadar kernel: shrink_node+0xb5/0x410
Jul 26 23:43:05 hadar kernel: ? shrink_node+0xb5/0x410
Jul 26 23:43:05 hadar kernel: balance_pgdat+0x293/0x5f0
Jul 26 23:43:05 hadar kernel: kswapd+0x156/0x3c0
Jul 26 23:43:05 hadar kernel: ? wait_woken+0x80/0x80
Jul 26 23:43:05 hadar kernel: kthread+0x121/0x140
Jul 26 23:43:05 hadar kernel: ? balance_pgdat+0x5f0/0x5f0
Jul 26 23:43:05 hadar kernel: ? kthread_park+0x90/0x90
Jul 26 23:43:05 hadar kernel: ret_from_fork+0x22/0x40
So, I wrote a tiny script to see how often the issue happens on each boot:
Code:
$ ~/bin/crashing
Boot -0 : 1
Boot -1 : 5
Boot -2 : 8
Boot -3 : 1
Boot -4 : 93
Boot -5 : 223
Boot -6 : 596
Boot -7 : 258
Boot -8 : 383
Boot -9 : 0
0 or 1 means no problem. Higher numbers show stack problems. Here's the script, but I doubt it will work for others as is:
$ more ~/bin/crashing
Code:
#!/bin/bash
for i in {0..9} ; do
RC=$(journalctl -b -$i |grep 'dump_stack' |wc -l)
echo "Boot -$i : $RC"
done;
3 boots ago, was when I lowered the DDR4 RAM speed from 2800 to 2733 and that ran without issue for about 10 days. Then a few days ago, even at 2733Mhz, the system began crashing - full lockup. No different TTY or ssh was possible. Had to use the reset button. Boot -2 and -1 where at the same 2733Mhz, but the last reboot I slowed the RAM again to 2666Mhz (I think).
Code:
$ sudo inxi -m
Memory: Used/Total: 15719.5/32108.7MB
Array-1 capacity: 128 GB devices: 4 EC: None
Device-1: DIMM_A1 size: 8 GB speed: 2666 MT/s type: DDR4
Device-2: DIMM_A2 size: 8 GB speed: 2666 MT/s type: DDR4
Device-3: DIMM_B1 size: 8 GB speed: 2666 MT/s type: DDR4
Device-4: DIMM_B2 size: 8 GB speed: 2666 MT/s type: DDR4
The box has been very busy the last few days. It has been up about 22 hrs now, which is good. On Saturday, during a maintenance window, I'll reseat the RAM, swap some paired DIMMs around in the slots, put the speed back up to 2933Mhz and see if that helps. If not, I'll take 2 of the sticks out, since the machine as 2-pairs which weren't bought as a matched set. 16G is a little tight for this system, but if I'm careful on RAM use, it should not be an issue.
Code:
$ free -m
total used free shared buff/cache available
Mem: 32108 15222 6836 41 10049 16382
Swap: 4355 119 4236
My current RAM use is just on the 16G cusp. I can half RAM allocated to one of the VMs and not power another, which should keep it around 12G used. I can move one or two VMs to a different system too.
Should say, this VM host has been fairly stable for 2.5 yrs. SMART data for the connected storage is all fine. No bad blocks or any reallocated blocks at all. That was the first thing I checked. Weekly SMART tests run automatically and get logged on all my storage.
Bookmarks