[nova][heat] The next steps to "fix" libvirt problems in Ubuntu Jammy

Takashi Kajinami tkajinam at redhat.com
Tue Apr 4 03:35:08 UTC 2023


Thanks Sean for these replies. These make sense to me.

As I mentioned in my earlier reply, I run c9s jobs several times and I did
confirm the issue
can be reproduced in c9s.
# The attempts can be found here:
https://review.opendev.org/c/openstack/heat/+/879014/1

The interesting finding was that the issue appears in c9s much less
frequently than Ubuntu.
(The issue is reproduced in c9s once but I didn't hit it during recheck
while ubuntu jobs were
 consistently blocked by the libvirt problem.)

I don't know what is causing that difference but sharing my observation
just in case that sounds
also interesting to the other people.


On Thu, Mar 30, 2023 at 8:18 PM Sean Mooney <smooney at redhat.com> wrote:

> On Thu, 2023-03-30 at 19:54 +0900, Takashi Kajinami wrote:
> > Thank you, Sylvain, for all these inputs !
> >
> > On Thu, Mar 30, 2023 at 7:10 PM Sylvain Bauza <sbauza at redhat.com> wrote:
> >
> > >
> > >
> > > Le jeu. 30 mars 2023 à 06:16, Takashi Kajinami <tkajinam at redhat.com> a
> > > écrit :
> > >
> > > > Hello,
> > > >
> > > >
> > > > Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat
> gate
> > > > jobs have
> > > > become very flaky. Further investigation revealed that the issue is
> > > > related to something
> > > > in libvirt from Ubuntu Jammy and that prevents detaching devices from
> > > > instances[1].
> > > >
> > > > The same problem appears in different jobs[2] and we workaround the
> > > > problem by disabling
> > > > some affected jobs. In heat we also disabled some flaky tests but
> because
> > > > of this we no longer
> > > > run basic scenario tests which deploys instance/volume/network in a
> > > > single stack, which means
> > > > we lost the quite basic test coverage.
> > > >
> > > > My question is, is there anyone in the Nova team working on "fixing"
> this
> > > > problem ?
> > > > We might be able to implement some workaround (like checking status
> of
> > > > the instances before
> > > > attempting to delete it) but this should be fixed in libvirt side
> IMO, as
> > > > this looks like a "regression"
> > > > in Ubuntu Jammy.
> > > > Probably we should report a bug against the libvirt package in
> Ubuntu but
> > > > I'd like to hear some
> > > > thoughts from the nova team because they are more directly affected
> by
> > > > this problem.
> > > >
> > > >
> > >
> > > FWIW, we discussed about it yesterday on our vPTG :
> > > https://etherpad.opendev.org/p/nova-bobcat-ptg#L289
> > >
> > > Most of the problems come from the volume detach thing. We also merged
> > > some Tempest changes for not trying to cleanup some volumes if the
> test was
> > > OK (thanks Dan for this). We also added more verifications to ask SSH
> to
> > > wait for a bit of time before calling the instance.
> > > Eventually, as you see in the etherpad, we didn't found any solutions
> but
> > > we'll try to add some canary job for testing multiple times volume
> > > attachs/detachs.
> > >
> >
> > > We'll also continue to discuss on the CI failures during every Nova
> weekly
> > > meetings (Tuesdays at 1600UTC on #openstack-nova) and I'll want to ask a
> > > cross-project session for the Vancouver pPTG for Tempest/Cinder/Nova
> and
> > > others.
> > > I leave other SMEs to reply on your other points, like for c9s.
> > >
> >
> > It's good to hear that the issue is still getting attention. I'll catch
> up
> > the discussion by reading the etherpad
> > and will try to attend follow-up discussions if possible, especially if I
> > can attend Vancouver vPTG.
> >
> > I know some changes have been proposed to check ssh-ability to workaround
> > the problem (though
> > the comment in the vPTG session indicates  that does not fully solve the
> > problem) but it's still annoying
> > because we don't really block resource deletions based on instance status
> > (especially its internal status)
> > so we eventually need some solutions here to avoid this problem, IMHO.
> >
> >
> > >
> > > > I'm now trying to set up a centos stream 9 job in Heat repo to see
> > > > whether this can be reproduced
> > > > if we use centos stream 9. I've been running that specific scenario
> test
> > > > in centos stream 9 jobs
> > > > in puppet repos but I've never seen this issue, so I suspect the
> issue is
> > > > really specific to libvirt
> > > > in Jammy.
> > > >
> > >
> > >
> > > Well, maybe I'm wrong, but no, we also have a centos9stream issue for
> > > volume detachs :
> > > https://bugs.launchpad.net/nova/+bug/1960346
> > >
> > >
> > I just managed to launch a c9s job in heat but it seems the issue is
> > reproducible in c9s as well[1].
>
> ya i replied in paralle in my other reply i noted that we saw this issue
> first in c9s then in ubuntu and we also see this in our internal downstram
> ci.
>
> changing the distro we use for the devstack jobs wont help unless we
> downgrade libvirt and qemu to before the
> orginal change in lbvirt was done. which would break other things.
> > I'll rerun the job a few more times to see how frequent the issue appears
> > in c9s compared to
> > ubuntu.
> > We do not run many tests in puppet jobs so that might be the reason I've
> > never hit it in
> > puppet jobs.
> >
> > [1] https://review.opendev.org/c/openstack/heat/+/879014
> >
> >
> > >
> > >
> > > > [1] https://bugs.launchpad.net/nova/+bug/1998274
> > > > [2] https://bugs.launchpad.net/nova/+bug/1998148
> > > >
> > > > Thank you,
> > > > Takashi
> > > >
> > >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230404/bb4fa67a/attachment.htm>


More information about the openstack-discuss mailing list