Hi Sean, Do you have time to look at the mem_stats_period_seconds / virtio memory balloon issue this week? -----Original Message----- From: Albert Braden <Albert.Braden@synopsys.com> Sent: Friday, February 7, 2020 2:26 PM To: Sean Mooney <smooney@redhat.com>; openstack-discuss@lists.openstack.org Subject: RE: Virtio memory balloon driver I opened a bug: https://bugs.launchpad.net/nova/+bug/1862425 -----Original Message----- From: Sean Mooney <smooney@redhat.com> Sent: Wednesday, February 5, 2020 10:25 AM To: Albert Braden <albertb@synopsys.com>; openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver On Wed, 2020-02-05 at 17:33 +0000, Albert Braden wrote:
When I start and stop the giant VM I don't see any evidence of OOM errors. I suspect that the #centos guys may be correct when they say that the "Virtio memory balloon" device is not capable of addressing that much memory, and that I must disable it if I want to create VMs with 1.4T RAM. Setting "mem_stats_period_seconds = 0" doesn't seem to disable it.
How are others working around this? Is anyone else creating Centos 6 VMs with 1.4T or more RAM? i suspect not. spawning 1 giant vm that uses all the resouse on the host is not a typical usecse. in general people move to ironic when the need a vm that large. i unfortunetly dont have time to look into this right now but we can likely add a way to disabel the ballon device and if you remind me in a day or two i can try and see why mem_stats_period_seconds = 0 is not working for you. looking at https://urldefense.proofpoint.com/v2/url?u=https-3A__opendev.org_openstack_nova_src_branch_master_nova_virt_libvirt_driver.py-23L5842-2DL5852&d=DwICaQ&c=DPL6_X_6JkXFx7AXWqB0tg&r=XrJBXYlVPpvOXkMqGPz6KucRW_ils95ZMrEmlTflPm8&m=WF6NUF1-K7cJv2js_9SXU42-chTUhO8odllpI7Mk26s&s=_kEGfZqTkPscjy0GJB2N_WBXRJPEt2400ADV12hhxR8&e= it should work but libvirt addes extra element to the xml after we generate it and fills in some fields. its possibel that libvirt is adding it and when we dont want the device we need to explcitly disable it in some way. if that is the case we could track this as a bug and potentially backport it.
The error is at line 404: [ 18.736435] BUG: unable to handle kernel paging request at ffff9ca8d9980000
Dmesg: [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:42 2020] device tap039191ba-25 left promiscuous mode [Tue Feb 4 17:50:42 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state [Tue Feb 4 17:50:47 2020] device tap039191ba-25 entered promiscuous mode [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state [Tue Feb 4 17:50:47 2020] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state
Syslog:
Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751339] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751342] brq49cbe55d-51: port 1(tap039191ba-25) entered disabled state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751450] device tap039191ba-25 entered promiscuous mode Feb 4 17:50:51 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained carrier Feb 4 17:50:51 us01odc-p01-hv214 libvirtd[37317]: 2020-02-05 01:50:51.386+0000: 37321: warning : qemuDomainObjTaint:5602 : Domain id=15 name='instance-00002164' uuid=33611060-887a-44c1-a3b8-1c36cb8f9984 is tainted: host-cpu Feb 4 17:50:51 us01odc-p01-hv214 systemd-udevd[238052]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. Feb 4 17:50:51 us01odc-p01-hv214 networkd-dispatcher[1214]: WARNING:Unknown index 32 seen, reloading interface list Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751683] brq49cbe55d-51: port 1(tap039191ba-25) entered blocking state Feb 4 17:50:51 us01odc-p01-hv214 kernel: [2859840.751685] brq49cbe55d-51: port 1(tap039191ba-25) entered forwarding state Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:51 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53 Feb 4 17:50:52 us01odc-p01-hv214 systemd-networkd[781]: tap039191ba-25: Gained IPv6LL Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: reading /etc/resolv.conf Feb 4 17:50:52 us01odc-p01-hv214 dnsmasq[28739]: using nameserver 127.0.0.53#53
-----Original Message----- From: Jeremy Stanley <fungi@yuggoth.org> Sent: Tuesday, February 4, 2020 4:01 AM To: openstack-discuss@lists.openstack.org Subject: Re: Virtio memory balloon driver
On 2020-02-03 23:57:28 +0000 (+0000), Albert Braden wrote:
We are reserving 2 CPU and 16G RAM for the hypervisor. I haven't seen any OOM errors. Where should I look for those?
[...]
The `dmesg` utility on the hypervisor host should show you the kernel's log ring buffer contents (the -T flag is useful to translate its timestamps into something more readable than seconds since boot too). If the ring buffer has overwritten the relevant timeframe then look for signs of kernel OOM killer invocation in your syslog or persistent journald storage.