Re: [nova][heat] The next steps to "fix" libvirt problems in Ubuntu Jammy

30 Mar 2023

      On Thu, 2023-03-30 at 12:10 +0200, Sylvain Bauza wrote:
...
Le jeu. 30 mars 2023 à 06:16, Takashi Kajinami <tkajinam@redhat.com> a
écrit :
...
Hello,
Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat gate
jobs have
become very flaky. Further investigation revealed that the issue is
related to something
in libvirt from Ubuntu Jammy and that prevents detaching devices from
instances[1].
for what its worth this is not a probelm that is new in jammy it also affect
the libvirt/qemu verion in focal and i centos 9 stream.

this detach issue was intoduced in qemu as a sideeffect of fixign a security issue.
we mostly mitigated the impact on Focal with some tempest changes but not entirly
...
...
The same problem appears in different jobs[2] and we workaround the
problem by disabling
some affected jobs. In heat we also disabled some flaky tests but because
of this we no longer
run basic scenario tests which deploys instance/volume/network in a single
stack, which means
we lost the quite basic test coverage.
My question is, is there anyone in the Nova team working on "fixing" this
problem ?
yes and no we cannot fix this in nova as it not a nova issue its a issue with
qemu/libvirt and possible cirros.
one possible "fix" is to stop using cirros so i did a few things last night
first i tried using the ubuntu-minimal-cloud-image
this is strip down image that is smaller and uses less memory

while it could boot with the normal cirros flavor with 128mb of ram it OOMd cloud-init
fortunetly it was after ssh was set up so i could log in but its too close to the memory limit to use.

second attempt was to revive my alpine disk image builder serise 
https://review.opendev.org/c/openstack/diskimage-builder/+/755410

that now works to generate really light weight image (its using about 30mb of ram while idel)

i am going to try creating a job that will use that instead of cirros
for now im just goign to use a pre playbook to build the image in the job and make destack use
that instead.
...
...
We might be able to implement some workaround (like checking status of the
instances before
attempting to delete it) but this should be fixed in libvirt side IMO, as
this looks like a "regression"
in Ubuntu Jammy.
This is not new in Jammy and it should affect RHEL9
i am very very surpsied this is not causeing us a lot of internal pain for our downstream
ci as it was breaking centos 9 before it started affecting ubuntu.

we have seen downstream detach issues but the sshablae changes in tempest mostly helped
so this is not just a ubuntu issue its affecting all distros includeing rhel.

this is the upstream libvirt bug for the current probelm https://gitlab.com/libvirt/libvirt/-/issues/309 
https://bugzilla.redhat.com/show_bug.cgi?id=2087047 is the downstream tracker for the libvirt team to actully
fix this i have left a comment there to see if i can move that along.
...
...
Probably we should report a bug against the libvirt package in Ubuntu but
I'd like to hear some
thoughts from the nova team because they are more directly affected by
this problem.
FWIW, we discussed about it yesterday on our vPTG :
https://etherpad.opendev.org/p/nova-bobcat-ptg#L289
Most of the problems come from the volume detach thing. We also merged some
Tempest changes for not trying to cleanup some volumes if the test was OK
(thanks Dan for this). We also added more verifications to ask SSH to wait
for a bit of time before calling the instance.
Eventually, as you see in the etherpad, we didn't found any solutions but
we'll try to add some canary job for testing multiple times volume
attachs/detachs.
We'll also continue to discuss on the CI failures during every Nova weekly
meetings (Tuesdays@1600UTC on #openstack-nova) and I'll want to ask a
cross-project session for the Vancouver pPTG for Tempest/Cinder/Nova and
others.
I leave other SMEs to reply on your other points, like for c9s.
c9s hit this before ubuntu did it will not help
...
I'm now trying to set up a centos stream 9 job in Heat repo to see whether
this can be reproduced
if we use centos stream 9. I've been running that specific scenario test
in centos stream 9 jobs
in puppet repos but I've never seen this issue, so I suspect the issue is
really specific to libvirt
in Jammy.
Well, maybe I'm wrong, but no, we also have a centos9stream issue for
volume detachs :
https://bugs.launchpad.net/nova/+bug/1960346
...
[1] https://bugs.launchpad.net/nova/+bug/1998274
[2] https://bugs.launchpad.net/nova/+bug/1998148
Thank you,
Takashi

Re: [nova][heat] The next steps to "fix" libvirt problems in Ubuntu Jammy

Sean Mooney