On Thu, 2023-03-30 at 12:10 +0200, Sylvain Bauza wrote:
Le jeu. 30 mars 2023 à 06:16, Takashi Kajinami <tkajinam@redhat.com> a écrit :
Hello,
Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat gate jobs have become very flaky. Further investigation revealed that the issue is related to something in libvirt from Ubuntu Jammy and that prevents detaching devices from instances[1].
for what its worth this is not a probelm that is new in jammy it also affect the libvirt/qemu verion in focal and i centos 9 stream. this detach issue was intoduced in qemu as a sideeffect of fixign a security issue. we mostly mitigated the impact on Focal with some tempest changes but not entirly
The same problem appears in different jobs[2] and we workaround the problem by disabling some affected jobs. In heat we also disabled some flaky tests but because of this we no longer run basic scenario tests which deploys instance/volume/network in a single stack, which means we lost the quite basic test coverage.
My question is, is there anyone in the Nova team working on "fixing" this problem ?
yes and no we cannot fix this in nova as it not a nova issue its a issue with qemu/libvirt and possible cirros.
one possible "fix" is to stop using cirros so i did a few things last night first i tried using the ubuntu-minimal-cloud-image this is strip down image that is smaller and uses less memory while it could boot with the normal cirros flavor with 128mb of ram it OOMd cloud-init fortunetly it was after ssh was set up so i could log in but its too close to the memory limit to use. second attempt was to revive my alpine disk image builder serise https://review.opendev.org/c/openstack/diskimage-builder/+/755410 that now works to generate really light weight image (its using about 30mb of ram while idel) i am going to try creating a job that will use that instead of cirros for now im just goign to use a pre playbook to build the image in the job and make destack use that instead.
We might be able to implement some workaround (like checking status of the instances before attempting to delete it) but this should be fixed in libvirt side IMO, as this looks like a "regression" in Ubuntu Jammy. This is not new in Jammy and it should affect RHEL9
i am very very surpsied this is not causeing us a lot of internal pain for our downstream ci as it was breaking centos 9 before it started affecting ubuntu. we have seen downstream detach issues but the sshablae changes in tempest mostly helped so this is not just a ubuntu issue its affecting all distros includeing rhel. this is the upstream libvirt bug for the current probelm https://gitlab.com/libvirt/libvirt/-/issues/309 https://bugzilla.redhat.com/show_bug.cgi?id=2087047 is the downstream tracker for the libvirt team to actully fix this i have left a comment there to see if i can move that along.
Probably we should report a bug against the libvirt package in Ubuntu but I'd like to hear some thoughts from the nova team because they are more directly affected by this problem.
FWIW, we discussed about it yesterday on our vPTG : https://etherpad.opendev.org/p/nova-bobcat-ptg#L289
Most of the problems come from the volume detach thing. We also merged some Tempest changes for not trying to cleanup some volumes if the test was OK (thanks Dan for this). We also added more verifications to ask SSH to wait for a bit of time before calling the instance. Eventually, as you see in the etherpad, we didn't found any solutions but we'll try to add some canary job for testing multiple times volume attachs/detachs.
We'll also continue to discuss on the CI failures during every Nova weekly meetings (Tuesdays@1600UTC on #openstack-nova) and I'll want to ask a cross-project session for the Vancouver pPTG for Tempest/Cinder/Nova and others. I leave other SMEs to reply on your other points, like for c9s. c9s hit this before ubuntu did it will not help
I'm now trying to set up a centos stream 9 job in Heat repo to see whether this can be reproduced if we use centos stream 9. I've been running that specific scenario test in centos stream 9 jobs in puppet repos but I've never seen this issue, so I suspect the issue is really specific to libvirt in Jammy.
Well, maybe I'm wrong, but no, we also have a centos9stream issue for volume detachs : https://bugs.launchpad.net/nova/+bug/1960346
[1] https://bugs.launchpad.net/nova/+bug/1998274 [2] https://bugs.launchpad.net/nova/+bug/1998148
Thank you, Takashi