Virtio memory balloon driver
When we build a Centos 7 VM with 1.4T RAM it fails with "[ 17.797177] BUG: unable to handle kernel paging request at ffff988b19478000" I asked in #centos and they asked me to show a list of devices from a working VM (if I use 720G RAM it works). This is the list: [root@alberttest1 ~]# lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device 00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:07.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon [root@alberttest1 ~]# lsusb Bus 001 Device 002: ID 0627:0001 Adomax Technology Co., Ltd Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub They suspect that the "Virtio memory balloon" driver is causing the problem, and that we should disable it. I googled around and found this: http://www.linux-kvm.org/page/Projects/auto-ballooning It looks like memory ballooning is deprecated. How can I get rid of the driver? Also they complained about my host bridge device; they say that we should have a newer one: 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) Where can I specify the host bridge? <bugs_> ok ozzzo one of the devices is called "virtio memory balloon" [13:18:12] <bugs_> do you see that? [13:18:21] <ozzzo> yes [13:18:47] <bugs_> i suggest you google that and read about what it does - i think it would [13:19:02] <bugs_> be worth trying to disable that device on your larger vm to see what happens [13:19:18] <ozzzo> ok I will try that, thank you [13:19:30] * Altiare (~Altiare@unaffiliated/altiare) has quit IRC (Quit: Leaving) [13:21:45] * Sheogorath[m] (sheogora1@gateway/shell/matrix.org/x-uiiwpoddodtgrwwz) joins #centos [13:22:06] <@TrevorH> I also notice that the VM seems to be using the very old 440FX and there's a newer model of hardware available that might be worth checking [13:22:21] <@TrevorH> 440FX chipset is the old old pentium Pro chipset! [13:22:32] <@TrevorH> I had one of those in about 1996
On Mon, Feb 3, 2020, at 1:36 PM, Albert Braden wrote:
When we build a Centos 7 VM with 1.4T RAM it fails with “[ 17.797177] BUG: unable to handle kernel paging request at ffff988b19478000”
I asked in #centos and they asked me to show a list of devices from a working VM (if I use 720G RAM it works). This is the list:
[root@alberttest1 ~]# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:07.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
[root@alberttest1 ~]# lsusb
Bus 001 Device 002: ID 0627:0001 Adomax Technology Co., Ltd
Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
They suspect that the “Virtio memory balloon” driver is causing the problem, and that we should disable it. I googled around and found this:
http://www.linux-kvm.org/page/Projects/auto-ballooning
It looks like memory ballooning is deprecated. How can I get rid of the driver?
Looking at Nova's code [0] the memballoon device is only set if mem_stats_period_seconds has a value greater than 0. The default [1] is 10 so you get it by default. I would try setting this config option to 0 and recreating the instance. Note I think this will apply to all VMs and was originally added so that tools could get memory usage statistics.
Also they complained about my host bridge device; they say that we should have a newer one:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
Where can I specify the host bridge?
For this I think you need to set hw_machine_type [2]. Looking at this bug [3] I think the value you may want is q35.
<bugs_> ok ozzzo one of the devices is called "virtio memory balloon"
[13:18:12] <bugs_> do you see that?
[13:18:21] <ozzzo> yes
[13:18:47] <bugs_> i suggest you google that and read about what it does - i think it would
[13:19:02] <bugs_> be worth trying to disable that device on your larger vm to see what happens
[13:19:18] <ozzzo> ok I will try that, thank you
[13:19:30] * Altiare (~Altiare@unaffiliated/altiare) has quit IRC (Quit: Leaving)
[13:21:45] * Sheogorath[m] (sheogora1@gateway/shell/matrix.org/x-uiiwpoddodtgrwwz) joins #centos
[13:22:06] <@TrevorH> I also notice that the VM seems to be using the very old 440FX and there's a newer model of hardware available that might be worth checking
[13:22:21] <@TrevorH> 440FX chipset is the old old pentium Pro chipset!
[13:22:32] <@TrevorH> I had one of those in about 1996
[0] https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/drive... [1] https://docs.openstack.org/nova/train/configuration/config.html#libvirt.mem_... [2] https://docs.openstack.org/nova/train/configuration/config.html#libvirt.hw_m... [3] https://bugs.launchpad.net/nova/+bug/1780138
On Mon, Feb 3, 2020, at 1:36 PM, Albert Braden wrote:
When we build a Centos 7 VM with 1.4T RAM it fails with “[ 17.797177] BUG: unable to handle kernel paging request at ffff988b19478000”
I asked in #centos and they asked me to show a list of devices from a working VM (if I use 720G RAM it works). This is the list:
[root@alberttest1 ~]# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:07.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
[root@alberttest1 ~]# lsusb
Bus 001 Device 002: ID 0627:0001 Adomax Technology Co., Ltd
Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
They suspect that the “Virtio memory balloon” driver is causing the problem, and that we should disable it. I googled around and found this:
http://www.linux-kvm.org/page/Projects/auto-ballooning
It looks like memory ballooning is deprecated. How can I get rid of the driver?
Looking at Nova's code [0] the memballoon device is only set if mem_stats_period_seconds has a value greater than 0. The default [1] is 10 so you get it by default. I would try setting this config option to 0 and recreating the instance. Note I think this will apply to all VMs and was originally added so that tools could get memory usage statistics. i forgot about that option. we had talked bout disableing the stats by default at one point. downstream i think we do at least via config on realtime hosts as we found the stat collect causes latency spikes.
Also they complained about my host bridge device; they say that we should have a newer one:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
Where can I specify the host bridge?
For this I think you need to set hw_machine_type [2]. Looking at this bug [3] I think the value you may want is q35. yes but if you enable q35 you neeed to also be aware that unlike the pc i440 machine type only 1 addtion pci slot will be allocated so if you want to allow attaching more then one volume or nic after teh vm is booted you need to adjust https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.num....
On Mon, 2020-02-03 at 14:11 -0800, Clark Boylan wrote: the more addtional pcie port you enable the more memory is required by qemu regardless of if you use them and by default even without allocating more pcie port qemu uses more memory in q35 mode then when using the pc machine type. you should also be aware that ide bus is not supported by default with q35 which causes issues for some older operating systems if you use config drives. with all that said we do want to eventully make q35 the default in nova but you just need to be aware that changing that has lots of other side effects which is why we have not done it yet. q35 is required for many new feature and is supported but its just not the default.
<bugs_> ok ozzzo one of the devices is called "virtio memory balloon"
[13:18:12] <bugs_> do you see that?
[13:18:21] <ozzzo> yes
[13:18:47] <bugs_> i suggest you google that and read about what it does - i think it would
[13:19:02] <bugs_> be worth trying to disable that device on your larger vm to see what happens
[13:19:18] <ozzzo> ok I will try that, thank you
[13:19:30] * Altiare (~Altiare@unaffiliated/altiare) has quit IRC (Quit: Leaving)
[13:21:45] * Sheogorath[m] (sheogora1@gateway/shell/matrix.org/x-uiiwpoddodtgrwwz) joins #centos
[13:22:06] <@TrevorH> I also notice that the VM seems to be using the very old 440FX and there's a newer model of hardware available that might be worth checking
[13:22:21] <@TrevorH> 440FX chipset is the old old pentium Pro chipset!
[13:22:32] <@TrevorH> I had one of those in about 1996
[0] https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/drive... [1] https://docs.openstack.org/nova/train/configuration/config.html#libvirt.mem_... [2] https://docs.openstack.org/nova/train/configuration/config.html#libvirt.hw_m... [3] https://bugs.launchpad.net/nova/+bug/1780138
On Mon, Feb 3, 2020, at 1:36 PM, Albert Braden wrote:
When we build a Centos 7 VM with 1.4T RAM it fails with “[ 17.797177] BUG: unable to handle kernel paging request at ffff988b19478000”
I asked in #centos and they asked me to show a list of devices from a working VM (if I use 720G RAM it works). This is the list:
[root@alberttest1 ~]# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device
00:07.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon
[root@alberttest1 ~]# lsusb
Bus 001 Device 002: ID 0627:0001 Adomax Technology Co., Ltd
Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
They suspect that the “Virtio memory balloon” driver is causing the problem, and that we should disable it. I googled around and found this:
It looks like memory ballooning is deprecated. How can I get rid of the driver?
Looking at Nova's code [0] the memballoon device is only set if mem_stats_period_seconds has a value greater than 0. The default [1] is 10 so you get it by default. I would try setting this config option to 0 and recreating the instance. Note I think this will apply to all VMs and was originally added so that tools could get memory usage statistics. i forgot about that option. we had talked bout disableing the stats by default at one point. downstream i think we do at least via config on realtime hosts as we found the stat collect causes latency spikes.
Also they complained about my host bridge device; they say that we should have a newer one:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
Where can I specify the host bridge?
For this I think you need to set hw_machine_type [2]. Looking at this bug [3] I think the value you may want is q35. yes but if you enable q35 you neeed to also be aware that unlike the pc i440 machine type only 1 addtion pci slot will be allocated so if you want to allow attaching more then one volume or nic after teh vm is booted you need to adjust https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.openstack.org_nova_latest_configuration_config.html-23libvirt.num-5Fpcie-5Fports&d=DwIFaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=CpgL1Yla6XeZXNt8yDzb1d1aB2MMUJ6P4cMQ5cpxSU4&s=y_xdeWDGbFM4fAr32iEVEER8d6hVZunyvwne1QJfOME&e= .
I set mem_stats_period_seconds = 0 in nova.conf on controllers and hypervisors, and restarted nova services, and then built another VM, but it still has the balloon device: albertb@alberttest4:~ $ lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device 00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:07.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon I'll try the q35 setting now. -----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: Monday, February 3, 2020 2:56 PM To: Clark Boylan <cboylan@sapwetik.org>; openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver On Mon, 2020-02-03 at 14:11 -0800, Clark Boylan wrote: the more addtional pcie port you enable the more memory is required by qemu regardless of if you use them and by default even without allocating more pcie port qemu uses more memory in q35 mode then when using the pc machine type. you should also be aware that ide bus is not supported by default with q35 which causes issues for some older operating systems if you use config drives. with all that said we do want to eventully make q35 the default in nova but you just need to be aware that changing that has lots of other side effects which is why we have not done it yet. q35 is required for many new feature and is supported but its just not the default.
<bugs_> ok ozzzo one of the devices is called "virtio memory balloon"
[13:18:12] <bugs_> do you see that?
[13:18:21] <ozzzo> yes
[13:18:47] <bugs_> i suggest you google that and read about what it does - i think it would
[13:19:02] <bugs_> be worth trying to disable that device on your larger vm to see what happens
[13:19:18] <ozzzo> ok I will try that, thank you
[13:19:30] * Altiare (~Altiare@unaffiliated/altiare) has quit IRC (Quit: Leaving)
[13:21:45] * Sheogorath[m] (sheogora1@gateway/shell/matrix.org/x-uiiwpoddodtgrwwz) joins #centos
[13:22:06] <@TrevorH> I also notice that the VM seems to be using the very old 440FX and there's a newer model of hardware available that might be worth checking
[13:22:21] <@TrevorH> 440FX chipset is the old old pentium Pro chipset!
[13:22:32] <@TrevorH> I had one of those in about 1996
[0] https://urldefense.proofpoint.com/v2/url?u=https-3A__opendev.org_openstack_nova_src_branch_master_nova_virt_libvirt_driver.py-23L5840-2DL5852&d=DwIFaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=CpgL1Yla6XeZXNt8yDzb1d1aB2MMUJ6P4cMQ5cpxSU4&s=H2dx7K2OyyHfGOjaQ51CGvU0308JWXFBp_80QuxCAPw&e= [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.openstack.org_nova_train_configuration_config.html-23libvirt.mem-5Fstats-5Fperiod-5Fseconds&d=DwIFaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=CpgL1Yla6XeZXNt8yDzb1d1aB2MMUJ6P4cMQ5cpxSU4&s=FWkMnQE0rIldStIjTeTrXlBoCR0Bb06TqNQpjQwwXuM&e= [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.openstack.org_nova_train_configuration_config.html-23libvirt.hw-5Fmachine-5Ftype&d=DwIFaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=CpgL1Yla6XeZXNt8yDzb1d1aB2MMUJ6P4cMQ5cpxSU4&s=p4N-kGlblu2E47dPo8qYHl2hv4BPROlLoM_-5YNbAzc&e= [3] https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1780138&d=DwIFaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=CpgL1Yla6XeZXNt8yDzb1d1aB2MMUJ6P4cMQ5cpxSU4&s=wZjiqcg-XvVbIVgTpHRsbPVuZd1K0mM4-BZ6P0JNMj8&e=
On Mon, 2020-02-03 at 21:36 +0000, Albert Braden wrote:
When we build a Centos 7 VM with 1.4T RAM it fails with "[ 17.797177] BUG: unable to handle kernel paging request at ffff988b19478000"
I asked in #centos and they asked me to show a list of devices from a working VM (if I use 720G RAM it works). This is the list:
[root@alberttest1 ~]# lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device 00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:07.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon [root@alberttest1 ~]# lsusb Bus 001 Device 002: ID 0627:0001 Adomax Technology Co., Ltd Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
They suspect that the "Virtio memory balloon" driver is causing the problem, and that we should disable it. I googled around and found this:
http://www.linux-kvm.org/page/Projects/auto-ballooning
It looks like memory ballooning is deprecated. How can I get rid of the driver? http://www.linux-kvm.org/page/Projects/auto-ballooning states that no qemu that exists today implements that feature but the fact you see it in lspci seams to be in conflict with that. there are several refernce to the feature in later release of qemu and it is documented in libvirt https://libvirt.org/formatdomain.html#elementsMemBalloon
there is no way to turn it off specificly currently and im not aware of it being deprecated. the guest will not interact witht he vitio memory balloon by default. it is there too allow the guest to free memory and retrun it to the host to allow copperation between the guests and host to enable memory oversubscription. i belive this normally need the qemu guest agent to be deploy to work fully. with a 1.4TB vm how much memory have you reserved on the host. qemu will need memory to implement the vm emulation and this tends to increase as the guess uses more resouces. my first incliantion would be to check it the vm was killed as a result of a OOM event on the host.
Also they complained about my host bridge device; they say that we should have a newer one:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
Where can I specify the host bridge?
you change this by specifying the machine type. you can use the q35 machine type instead. q35 is the replacement for i440 but when you enable it it will change a lot of other parameters. i dont know if it will disable the virtio memory ballon or not but if you are using large amount of memory you should also be using hugepages to reduce the hoverhead and improve performance. you can either set the machine type in the config https://docs.openstack.org/nova/latest/configuration/config.html#libvirt.hw_... [libvirt] hw_machine_type=x86_64=q35 or in the guest image https://github.com/openstack/glance/blob/master/etc/metadefs/compute-libvirt... e.g. hw_machine_type=q35 note in the image you dont include the arch
<bugs_> ok ozzzo one of the devices is called "virtio memory balloon" [13:18:12] <bugs_> do you see that? [13:18:21] <ozzzo> yes [13:18:47] <bugs_> i suggest you google that and read about what it does - i think it would [13:19:02] <bugs_> be worth trying to disable that device on your larger vm to see what happens [13:19:18] <ozzzo> ok I will try that, thank you [13:19:30] * Altiare (~Altiare@unaffiliated/altiare) has quit IRC (Quit: Leaving) [13:21:45] * Sheogorath[m] (sheogora1@gateway/shell/matrix.org/x-uiiwpoddodtgrwwz) joins #centos [13:22:06] <@TrevorH> I also notice that the VM seems to be using the very old 440FX and there's a newer model of hardware available that might be worth checking [13:22:21] <@TrevorH> 440FX chipset is the old old pentium Pro chipset! [13:22:32] <@TrevorH> I had one of those in about 1996
yes it is an old chip set form the 90s but it is the default that openstack has used since it was created. we will likely change that in a cycle or two but really dont be surprised that we are using 440fx by default. its not really emulating a plathform form 1996. it started that way but it has been updated with the same name kept. with that said it does not support pcie or many other fature which is why we want to move too q35. q35 however while much more modern and secure uses more memroy and does not support older operating systems so there are trade offs. if you need to run centos 5 or 6 i would not be surrpised if you have issue with q35.
We are reserving 2 CPU and 16G RAM for the hypervisor. I haven't seen any OOM errors. Where should I look for those? -----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: Monday, February 3, 2020 2:47 PM To: Albert Braden <albertb@synopsys.com>; OpenStack Discuss ML <openstack-discuss@lists.openstack.org> Subject: Re: Virtio memory balloon driver On Mon, 2020-02-03 at 21:36 +0000, Albert Braden wrote:
When we build a Centos 7 VM with 1.4T RAM it fails with "[ 17.797177] BUG: unable to handle kernel paging request at ffff988b19478000"
I asked in #centos and they asked me to show a list of devices from a working VM (if I use 720G RAM it works). This is the list:
[root@alberttest1 ~]# lspci 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] 00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01) 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03) 00:02.0 VGA compatible controller: Cirrus Logic GD 5446 00:03.0 Ethernet controller: Red Hat, Inc. Virtio network device 00:04.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:05.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:06.0 SCSI storage controller: Red Hat, Inc. Virtio block device 00:07.0 Unclassified device [00ff]: Red Hat, Inc. Virtio memory balloon [root@alberttest1 ~]# lsusb Bus 001 Device 002: ID 0627:0001 Adomax Technology Co., Ltd Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
They suspect that the "Virtio memory balloon" driver is causing the problem, and that we should disable it. I googled around and found this:
It looks like memory ballooning is deprecated. How can I get rid of the driver? https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linux-2Dkvm.org_page_Projects_auto-2Dballooning&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=iMCOz64cWwXgiqbpObDBdGZPUoLuQp4G931VKc_hqxI&s=uEnvgAhTPKxJpvz6a3bQisI9406ul8Q2SSHDCV1lqvU&e= states that no qemu that exists today implements that feature but the fact you see it in lspci seams to be in conflict with that. there are several refernce to the feature in later release of qemu and it is documented in libvirt https://urldefense.proofpoint.com/v2/url?u=https-3A__libvirt.org_formatdomain.html-23elementsMemBalloon&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=iMCOz64cWwXgiqbpObDBdGZPUoLuQp4G931VKc_hqxI&s=DSej4fERS8HGIYb7CaIkbVBpssWtSxCbBRxukkAH0rI&e=
there is no way to turn it off specificly currently and im not aware of it being deprecated. the guest will not interact witht he vitio memory balloon by default. it is there too allow the guest to free memory and retrun it to the host to allow copperation between the guests and host to enable memory oversubscription. i belive this normally need the qemu guest agent to be deploy to work fully. with a 1.4TB vm how much memory have you reserved on the host. qemu will need memory to implement the vm emulation and this tends to increase as the guess uses more resouces. my first incliantion would be to check it the vm was killed as a result of a OOM event on the host.
On 2020-02-03 23:57:28 +0000 (+0000), Albert Braden wrote:
We are reserving 2 CPU and 16G RAM for the hypervisor. I haven't seen any OOM errors. Where should I look for those? [...]
The `dmesg` utility on the hypervisor host should show you the kernel's log ring buffer contents (the -T flag is useful to translate its timestamps into something more readable than seconds since boot too). If the ring buffer has overwritten the relevant timeframe then look for signs of kernel OOM killer invocation in your syslog or persistent journald storage. -- Jeremy Stanley
When I start and stop the giant VM I don't see any evidence of OOM errors. I suspect that the #centos guys may be correct when they say that the "Virtio memory balloon" device is not capable of addressing that much memory, and that I must disable it if I want to create VMs with 1.4T RAM. Setting "mem_stats_period_seconds = 0" doesn't seem to disable it. How are others working around this? Is anyone else creating Centos 6 VMs with 1.4T or more RAM? Console log: https://f.perl.bot/p/njvgbm The error is at line 404: [ 18.736435] BUG: unable to handle kernel paging request at ffff9ca8d9980000 Dmesg: [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:42 2020] device tap039191ba-25 left promiscuous mode [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] device tap039191ba-25 entered promiscuous mode [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state Syslog: Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751339] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751342] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751450] device tap039191ba-25 entered promiscuous mode Feb 4 17:50:51 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained carrier Feb 4 17:50:51 us01odc-p01-hv214 libvirtd[37317]: 2020-02-05 01:50:51.386+0000: 37321: warning : qemuDomainObjTaint:5602 : Domain id=15 name='instance-00002164' uuid=33611060-887a-44c1-a3b8-1c36cb8f9984 is tainted: host-cpu Feb 4 17:50:51 us01odc-p01-hv214 systemd-udevd[238052]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Feb 4 17:50:51 us01odc-p01-hv214 networkd-dispatcher[1214]: WARNING:Unknown index 32 seen, reloading interface list Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751683] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751685] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:52 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained IPv6LL Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 -----Original Message----- From: Jeremy Stanley <fungi@yuggoth.org> Sent: Tuesday, February 4, 2020 4:01 AM To: openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver On 2020-02-03 23:57:28 +0000 (+0000), Albert Braden wrote:
We are reserving 2 CPU and 16G RAM for the hypervisor. I haven't seen any OOM errors. Where should I look for those? [...]
The `dmesg` utility on the hypervisor host should show you the kernel's log ring buffer contents (the -T flag is useful to translate its timestamps into something more readable than seconds since boot too). If the ring buffer has overwritten the relevant timeframe then look for signs of kernel OOM killer invocation in your syslog or persistent journald storage. -- Jeremy Stanley
On Wed, 2020-02-05 at 17:33 +0000, Albert Braden wrote:
When I start and stop the giant VM I don't see any evidence of OOM errors. I suspect that the #centos guys may be correct when they say that the "Virtio memory balloon" device is not capable of addressing that much memory, and that I must disable it if I want to create VMs with 1.4T RAM. Setting "mem_stats_period_seconds = 0" doesn't seem to disable it.
How are others working around this? Is anyone else creating Centos 6 VMs with 1.4T or more RAM? i suspect not. spawning 1 giant vm that uses all the resouse on the host is not a typical usecse. in general people move to ironic when the need a vm that large. i unfortunetly dont have time to look into this right now but we can likely add a way to disabel the ballon device and if you remind me in a day or two i can try and see why mem_stats_period_seconds = 0 is not working for you. looking at https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/drive... it should work but libvirt addes extra element to the xml after we generate it and fills in some fields. its possibel that libvirt is adding it and when we dont want the device we need to explcitly disable it in some way. if that is the case we could track this as a bug and potentially backport it.
Console log: https://f.perl.bot/p/njvgbm
The error is at line 404: [ 18.736435] BUG: unable to handle kernel paging request at ffff9ca8d9980000
Dmesg: [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:42 2020] device tap039191ba-25 left promiscuous mode [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] device tap039191ba-25 entered promiscuous mode [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state
Syslog:
Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751339] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751342] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751450] device tap039191ba-25 entered promiscuous mode Feb 4 17:50:51 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained carrier Feb 4 17:50:51 us01odc-p01-hv214 libvirtd[37317]: 2020-02-05 01:50:51.386+0000: 37321: warning : qemuDomainObjTaint:5602 : Domain id=15 name='instance-00002164' uuid=33611060-887a-44c1-a3b8-1c36cb8f9984 is tainted: host-cpu Feb 4 17:50:51 us01odc-p01-hv214 systemd-udevd[238052]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Feb 4 17:50:51 us01odc-p01-hv214 networkd-dispatcher[1214]: WARNING:Unknown index 32 seen, reloading interface list Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751683] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751685] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:52 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained IPv6LL Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53
-----Original Message----- From: Jeremy Stanley <fungi@yuggoth.org> Sent: Tuesday, February 4, 2020 4:01 AM To: openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver
On 2020-02-03 23:57:28 +0000 (+0000), Albert Braden wrote:
We are reserving 2 CPU and 16G RAM for the hypervisor. I haven't seen any OOM errors. Where should I look for those?
[...]
The `dmesg` utility on the hypervisor host should show you the kernel's log ring buffer contents (the -T flag is useful to translate its timestamps into something more readable than seconds since boot too). If the ring buffer has overwritten the relevant timeframe then look for signs of kernel OOM killer invocation in your syslog or persistent journald storage.
Thanks Sean! Should I start a bug report for this? -----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: Wednesday, February 5, 2020 10:25 AM To: Albert Braden <albertb@synopsys.com>; openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver On Wed, 2020-02-05 at 17:33 +0000, Albert Braden wrote:
When I start and stop the giant VM I don't see any evidence of OOM errors. I suspect that the #centos guys may be correct when they say that the "Virtio memory balloon" device is not capable of addressing that much memory, and that I must disable it if I want to create VMs with 1.4T RAM. Setting "mem_stats_period_seconds = 0" doesn't seem to disable it.
How are others working around this? Is anyone else creating Centos 6 VMs with 1.4T or more RAM? i suspect not. spawning 1 giant vm that uses all the resouse on the host is not a typical usecse. in general people move to ironic when the need a vm that large. i unfortunetly dont have time to look into this right now but we can likely add a way to disabel the ballon device and if you remind me in a day or two i can try and see why mem_stats_period_seconds = 0 is not working for you. looking at https://urldefense.proofpoint.com/v2/url?u=https-3A__opendev.org_openstack_nova_src_branch_master_nova_virt_libvirt_driver.py-23L5842-2DL5852&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=WF6NUF1-K7cJv2js_9SXU42-chTUhO8odllpI7Mk26s&s=_kEGfZqTkPscjy0GJB2N_WBXRJPEt2400ADV12hhxR8&e= it should work but libvirt addes extra element to the xml after we generate it and fills in some fields. its possibel that libvirt is adding it and when we dont want the device we need to explcitly disable it in some way. if that is the case we could track this as a bug and potentially backport it.
The error is at line 404: [ 18.736435] BUG: unable to handle kernel paging request at ffff9ca8d9980000
Dmesg: [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:42 2020] device tap039191ba-25 left promiscuous mode [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] device tap039191ba-25 entered promiscuous mode [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state
Syslog:
Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751339] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751342] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751450] device tap039191ba-25 entered promiscuous mode Feb 4 17:50:51 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained carrier Feb 4 17:50:51 us01odc-p01-hv214 libvirtd[37317]: 2020-02-05 01:50:51.386+0000: 37321: warning : qemuDomainObjTaint:5602 : Domain id=15 name='instance-00002164' uuid=33611060-887a-44c1-a3b8-1c36cb8f9984 is tainted: host-cpu Feb 4 17:50:51 us01odc-p01-hv214 systemd-udevd[238052]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Feb 4 17:50:51 us01odc-p01-hv214 networkd-dispatcher[1214]: WARNING:Unknown index 32 seen, reloading interface list Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751683] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751685] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:52 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained IPv6LL Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53
-----Original Message----- From: Jeremy Stanley <fungi@yuggoth.org> Sent: Tuesday, February 4, 2020 4:01 AM To: openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver
On 2020-02-03 23:57:28 +0000 (+0000), Albert Braden wrote:
We are reserving 2 CPU and 16G RAM for the hypervisor. I haven't seen any OOM errors. Where should I look for those?
[...]
The `dmesg` utility on the hypervisor host should show you the kernel's log ring buffer contents (the -T flag is useful to translate its timestamps into something more readable than seconds since boot too). If the ring buffer has overwritten the relevant timeframe then look for signs of kernel OOM killer invocation in your syslog or persistent journald storage.
I opened a bug: https://bugs.launchpad.net/nova/+bug/1862425 -----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: Wednesday, February 5, 2020 10:25 AM To: Albert Braden <albertb@synopsys.com>; openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver On Wed, 2020-02-05 at 17:33 +0000, Albert Braden wrote:
When I start and stop the giant VM I don't see any evidence of OOM errors. I suspect that the #centos guys may be correct when they say that the "Virtio memory balloon" device is not capable of addressing that much memory, and that I must disable it if I want to create VMs with 1.4T RAM. Setting "mem_stats_period_seconds = 0" doesn't seem to disable it.
How are others working around this? Is anyone else creating Centos 6 VMs with 1.4T or more RAM? i suspect not. spawning 1 giant vm that uses all the resouse on the host is not a typical usecse. in general people move to ironic when the need a vm that large. i unfortunetly dont have time to look into this right now but we can likely add a way to disabel the ballon device and if you remind me in a day or two i can try and see why mem_stats_period_seconds = 0 is not working for you. looking at https://urldefense.proofpoint.com/v2/url?u=https-3A__opendev.org_openstack_nova_src_branch_master_nova_virt_libvirt_driver.py-23L5842-2DL5852&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=WF6NUF1-K7cJv2js_9SXU42-chTUhO8odllpI7Mk26s&s=_kEGfZqTkPscjy0GJB2N_WBXRJPEt2400ADV12hhxR8&e= it should work but libvirt addes extra element to the xml after we generate it and fills in some fields. its possibel that libvirt is adding it and when we dont want the device we need to explcitly disable it in some way. if that is the case we could track this as a bug and potentially backport it.
The error is at line 404: [ 18.736435] BUG: unable to handle kernel paging request at ffff9ca8d9980000
Dmesg: [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:42 2020] device tap039191ba-25 left promiscuous mode [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] device tap039191ba-25 entered promiscuous mode [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state
Syslog:
Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751339] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751342] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751450] device tap039191ba-25 entered promiscuous mode Feb 4 17:50:51 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained carrier Feb 4 17:50:51 us01odc-p01-hv214 libvirtd[37317]: 2020-02-05 01:50:51.386+0000: 37321: warning : qemuDomainObjTaint:5602 : Domain id=15 name='instance-00002164' uuid=33611060-887a-44c1-a3b8-1c36cb8f9984 is tainted: host-cpu Feb 4 17:50:51 us01odc-p01-hv214 systemd-udevd[238052]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Feb 4 17:50:51 us01odc-p01-hv214 networkd-dispatcher[1214]: WARNING:Unknown index 32 seen, reloading interface list Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751683] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751685] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:52 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained IPv6LL Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53
-----Original Message----- From: Jeremy Stanley <fungi@yuggoth.org> Sent: Tuesday, February 4, 2020 4:01 AM To: openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver
On 2020-02-03 23:57:28 +0000 (+0000), Albert Braden wrote:
We are reserving 2 CPU and 16G RAM for the hypervisor. I haven't seen any OOM errors. Where should I look for those?
[...]
The `dmesg` utility on the hypervisor host should show you the kernel's log ring buffer contents (the -T flag is useful to translate its timestamps into something more readable than seconds since boot too). If the ring buffer has overwritten the relevant timeframe then look for signs of kernel OOM killer invocation in your syslog or persistent journald storage.
Hi Sean, Do you have time to look at the mem_stats_period_seconds / virtio memory balloon issue this week? -----Original Message----- From: Albert Braden <Albert.Braden@synopsys.com> Sent: Friday, February 7, 2020 2:26 PM To: Sean Mooney <smooney@redhat.com>; openstack-discuss@lists.openstack.org Subject: RE: Virtio memory balloon driver I opened a bug: https://bugs.launchpad.net/nova/+bug/1862425 -----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: Wednesday, February 5, 2020 10:25 AM To: Albert Braden <albertb@synopsys.com>; openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver On Wed, 2020-02-05 at 17:33 +0000, Albert Braden wrote:
When I start and stop the giant VM I don't see any evidence of OOM errors. I suspect that the #centos guys may be correct when they say that the "Virtio memory balloon" device is not capable of addressing that much memory, and that I must disable it if I want to create VMs with 1.4T RAM. Setting "mem_stats_period_seconds = 0" doesn't seem to disable it.
How are others working around this? Is anyone else creating Centos 6 VMs with 1.4T or more RAM? i suspect not. spawning 1 giant vm that uses all the resouse on the host is not a typical usecse. in general people move to ironic when the need a vm that large. i unfortunetly dont have time to look into this right now but we can likely add a way to disabel the ballon device and if you remind me in a day or two i can try and see why mem_stats_period_seconds = 0 is not working for you. looking at https://urldefense.proofpoint.com/v2/url?u=https-3A__opendev.org_openstack_nova_src_branch_master_nova_virt_libvirt_driver.py-23L5842-2DL5852&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=WF6NUF1-K7cJv2js_9SXU42-chTUhO8odllpI7Mk26s&s=_kEGfZqTkPscjy0GJB2N_WBXRJPEt2400ADV12hhxR8&e= it should work but libvirt addes extra element to the xml after we generate it and fills in some fields. its possibel that libvirt is adding it and when we dont want the device we need to explcitly disable it in some way. if that is the case we could track this as a bug and potentially backport it.
The error is at line 404: [ 18.736435] BUG: unable to handle kernel paging request at ffff9ca8d9980000
Dmesg: [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:42 2020] device tap039191ba-25 left promiscuous mode [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] device tap039191ba-25 entered promiscuous mode [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state
Syslog:
Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751339] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751342] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751450] device tap039191ba-25 entered promiscuous mode Feb 4 17:50:51 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained carrier Feb 4 17:50:51 us01odc-p01-hv214 libvirtd[37317]: 2020-02-05 01:50:51.386+0000: 37321: warning : qemuDomainObjTaint:5602 : Domain id=15 name='instance-00002164' uuid=33611060-887a-44c1-a3b8-1c36cb8f9984 is tainted: host-cpu Feb 4 17:50:51 us01odc-p01-hv214 systemd-udevd[238052]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Feb 4 17:50:51 us01odc-p01-hv214 networkd-dispatcher[1214]: WARNING:Unknown index 32 seen, reloading interface list Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751683] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751685] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:52 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained IPv6LL Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53
-----Original Message----- From: Jeremy Stanley <fungi@yuggoth.org> Sent: Tuesday, February 4, 2020 4:01 AM To: openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver
On 2020-02-03 23:57:28 +0000 (+0000), Albert Braden wrote:
We are reserving 2 CPU and 16G RAM for the hypervisor. I haven't seen any OOM errors. Where should I look for those?
[...]
The `dmesg` utility on the hypervisor host should show you the kernel's log ring buffer contents (the -T flag is useful to translate its timestamps into something more readable than seconds since boot too). If the ring buffer has overwritten the relevant timeframe then look for signs of kernel OOM killer invocation in your syslog or persistent journald storage.
participants (4)
-
Albert Braden
-
Clark Boylan
-
Jeremy Stanley
-
Sean Mooney