[nova][heat] The next steps to "fix" libvirt problems in Ubuntu Jammy

Sean Mooney smooney at redhat.com
Thu Mar 30 11:16:09 UTC 2023


On Thu, 2023-03-30 at 12:10 +0200, Sylvain Bauza wrote:
> Le jeu. 30 mars 2023 à 06:16, Takashi Kajinami <tkajinam at redhat.com> a
> écrit :
> 
> > Hello,
> > 
> > 
> > Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat gate
> > jobs have
> > become very flaky. Further investigation revealed that the issue is
> > related to something
> > in libvirt from Ubuntu Jammy and that prevents detaching devices from
> > instances[1].

for what its worth this is not a probelm that is new in jammy it also affect
the libvirt/qemu verion in focal and i centos 9 stream.

this detach issue was intoduced in qemu as a sideeffect of fixign a security issue.
we mostly mitigated the impact on Focal with some tempest changes but not entirly

> > 
> > The same problem appears in different jobs[2] and we workaround the
> > problem by disabling
> > some affected jobs. In heat we also disabled some flaky tests but because
> > of this we no longer
> > run basic scenario tests which deploys instance/volume/network in a single
> > stack, which means
> > we lost the quite basic test coverage.
> > 
> > My question is, is there anyone in the Nova team working on "fixing" this
> > problem ?
yes and no we cannot fix this in nova as it not a nova issue its a issue with
qemu/libvirt and possible cirros.

one possible "fix" is to stop using cirros so i did a few things last night
first i tried using the ubuntu-minimal-cloud-image
this is strip down image that is smaller and uses less memory

while it could boot with the normal cirros flavor with 128mb of ram it OOMd cloud-init
fortunetly it was after ssh was set up so i could log in but its too close to the memory limit to use.

second attempt was to revive my alpine disk image builder serise 
https://review.opendev.org/c/openstack/diskimage-builder/+/755410

that now works to generate really light weight image (its using about 30mb of ram while idel)

i am going to try creating a job that will use that instead of cirros
for now im just goign to use a pre playbook to build the image in the job and make destack use
that instead.


> > We might be able to implement some workaround (like checking status of the
> > instances before
> > attempting to delete it) but this should be fixed in libvirt side IMO, as
> > this looks like a "regression"
> > in Ubuntu Jammy.
This is not new in Jammy and it should affect RHEL9

i am very very surpsied this is not causeing us a lot of internal pain for our downstream
ci as it was breaking centos 9 before it started affecting ubuntu.

we have seen downstream detach issues but the sshablae changes in tempest mostly helped
so this is not just a ubuntu issue its affecting all distros includeing rhel.

this is the upstream libvirt bug for the current probelm https://gitlab.com/libvirt/libvirt/-/issues/309 
https://bugzilla.redhat.com/show_bug.cgi?id=2087047 is the downstream tracker for the libvirt team to actully
fix this i have left a comment there to see if i can move that along.

> > Probably we should report a bug against the libvirt package in Ubuntu but
> > I'd like to hear some
> > thoughts from the nova team because they are more directly affected by
> > this problem.
> > 
> > 
> 
> FWIW, we discussed about it yesterday on our vPTG :
> https://etherpad.opendev.org/p/nova-bobcat-ptg#L289
> 
> Most of the problems come from the volume detach thing. We also merged some
> Tempest changes for not trying to cleanup some volumes if the test was OK
> (thanks Dan for this). We also added more verifications to ask SSH to wait
> for a bit of time before calling the instance.
> Eventually, as you see in the etherpad, we didn't found any solutions but
> we'll try to add some canary job for testing multiple times volume
> attachs/detachs.
> 
> We'll also continue to discuss on the CI failures during every Nova weekly
> meetings (Tuesdays at 1600UTC on #openstack-nova) and I'll want to ask a
> cross-project session for the Vancouver pPTG for Tempest/Cinder/Nova and
> others.
> I leave other SMEs to reply on your other points, like for c9s.
c9s hit this before ubuntu did it will not help
> 
> 
> > I'm now trying to set up a centos stream 9 job in Heat repo to see whether
> > this can be reproduced
> > if we use centos stream 9. I've been running that specific scenario test
> > in centos stream 9 jobs
> > in puppet repos but I've never seen this issue, so I suspect the issue is
> > really specific to libvirt
> > in Jammy.
> > 
> 
> 
> Well, maybe I'm wrong, but no, we also have a centos9stream issue for
> volume detachs :
> https://bugs.launchpad.net/nova/+bug/1960346
> 
> 
> 
> > [1] https://bugs.launchpad.net/nova/+bug/1998274
> > [2] https://bugs.launchpad.net/nova/+bug/1998148
> > 
> > Thank you,
> > Takashi
> > 




More information about the openstack-discuss mailing list