Hi,
We have made a huge mistake and ordered Dell R410 a year ago.
We used Debian Lenny with Xen on our production servers. These Dell R410 are not supported by Xen kernel (no network driver). I have spent weeks trying to get Xen work on it.
I have given up and install Ubuntu 9.04 with KVM. Which was later upgraded to Lucid.
These servers are plagued by stability issues.
For example:
Guest will dissipater randomly (segfault? without leaving anything in logs).
Network will drop out for a moment (creating split-brain with heartbeat).
Network will drop out completely.
The setup is default Lucid 64bit server install, which uses KVM in bridge mode. The problem been with Jaunty as well.
All these problems occur under moderate disk usage (rsync activity for example).
Guests are debian lenny and ubuntu jaunty.
Hosts are Dell Poweredge R410 with 16GB Ram. Running software Raid1 with LVM on top. Disks are couple of months old WD 500GB.
The errors on guests are as following:
And so on.Code:[423878.950832] INFO: task pdflush:23 blocked for more than 120 seconds. [423878.954209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [423878.957855] pdflush D ffff88006acd3ca0 0 23 2 [423878.960959] ffff88006acd3c90 0000000000000046 ffffffff80a65900 ffffffff80a65900 [423878.964347] ffffffff80a65900 ffffffff80a65900 ffffffff80a65900 ffffffff80a65900 [423878.968174] ffffffff80a65900 ffffffff80a65900 ffff88006d2b2cc0 ffff88003001c320 [423878.971697] Call Trace: [423878.973375] [<ffffffffa0144d4d>] xfs_buf_wait_unpin+0xbd/0x130 [xfs] [423878.976220] [<ffffffff8024a6f0>] ? default_wake_function+0x0/0x10 [423878.979104] [<ffffffffa0144b57>] ? xfs_buf_rele+0x37/0xd0 [xfs] [423878.981940] [<ffffffffa0144e2d>] xfs_buf_iorequest+0x6d/0x90 [xfs] [423878.984793] [<ffffffffa0149ee5>] xfs_bdstrat_cb+0x35/0x60 [xfs] [423878.987609] [<ffffffffa01409cf>] xfs_bwrite+0x6f/0xf0 [xfs] [423878.989205] [<ffffffffa013b10f>] xfs_syncsub+0x13f/0x300 [xfs] [423878.991434] [<ffffffffa013b317>] xfs_sync+0x47/0x70 [xfs] [423878.993643] [<ffffffffa014b8b3>] xfs_fs_write_super+0x23/0x30 [xfs] [423878.996378] [<ffffffff802e99fc>] sync_supers+0x8c/0xe0 [423878.998578] [<ffffffff802b80c2>] wb_kupdate+0x32/0x120 [423879.000772] [<ffffffff802b99f6>] __pdflush+0x136/0x220 [423879.002789] [<ffffffff802b9b36>] pdflush+0x56/0x60 [423879.005014] [<ffffffff802b8090>] ? wb_kupdate+0x0/0x120 [423879.007964] [<ffffffff802b9ae0>] ? pdflush+0x0/0x60 [423879.009693] [<ffffffff80268689>] kthread+0x49/0x90 [423879.011313] [<ffffffff80213979>] child_rip+0xa/0x11 [423879.012788] [<ffffffff80268640>] ? kthread+0x0/0x90 [423879.014347] [<ffffffff8021396f>] ? child_rip+0x0/0x11 [423887.233817] ata1: device not ready (errno=-16), forcing hardreset [423887.236241] ata1: soft resetting link [423887.467229] ata1.00: configured for MWDMA2 [423887.468727] ata1: EH complete [423887.645648] sd 0:0:0:0: [sda] 12582912 512-byte hardware sectors: (6.44 GB/6.00 GiB) [423887.726022] sd 0:0:0:0: [sda] Write Protect is off [423887.729317] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00 [423887.913874] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA [429349.139745] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen [429349.141514] ata1.00: cmd ca/00:08:67:c9:2f/00:00:00:00:00/e0 tag 0 dma 4096 out [429349.141516] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [429349.145699] ata1.00: status: { DRDY }
Never had problems like that with Xen even if the guests where having load average of 100+. While with KVM rsync from one guest to another will cause this to happen on every guest on that machine.
Here is the example of guest definition:
As you can see we are using LVM for the guest (it would be very silly to use file in production)Code:<domain type='kvm'> <name>myguest</name> <memory>131072</memory> <currentMemory>131072</currentMemory> <vcpu>1</vcpu> <os> <type arch='x86_64' machine='pc-0.12'>hvm</type> <boot dev='hd'/> </os> <features> <acpi/> </features> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/bin/kvm</emulator> <disk type='block' device='disk'> <source dev='/dev/vg00/myguest'/> <target dev='hda' bus='ide'/> </disk> <interface type='bridge'> <source bridge='br0'/> <model type='virtio'/> </interface> <serial type='pty'> <target port='0'/> </serial> <console type='pty'> <target port='0'/> </console> <input type='mouse' bus='ps2'/> <graphics type='vnc' port='-1' autoport='yes' listen='127.0.0.1'/> <video> <model type='cirrus' vram='9216' heads='1'/> </video> </devices> </domain>
We also use Xen style LVM without partition or boot loader and with external kernel, the load issue affects either of styles.
Right now we will decommission otherwise brand new servers and try to resolve this issue (most likely by buying intel network cards and installing Xen).
I can't believe other people use KVM in their production, perhaps they don't care if their guests crash in worst possible way (semi-dissapearing).
I am at lost here...
EDIT: also another extremely bad behaviour is that kvm machines are treated exactly like any other process, which means they can be swapped. Even though I did not overcommit memory, in many occasions I saw guest being swapped out, eg:
In this case if I add up all the guests it works out 14Gb (which leaves 2Gb for host which is far more than sufficient).Code:Mem: 16455628k total, 15651600k used, 804028k free, 3770560k buffers Swap: 3903480k total, 1704884k used, 2198596k free, 22344k cached
I have vm_swappiness set to 0 ...



Adv Reply



Bookmarks