Thanks Sean for these replies. These make sense to me.

As I mentioned in my earlier reply, I run c9s jobs several times and I did confirm the issue
can be reproduced in c9s.
# The attempts can be found here: https://review.opendev.org/c/openstack/heat/+/879014/1

The interesting finding was that the issue appears in c9s much less frequently than Ubuntu.
(The issue is reproduced in c9s once but I didn't hit it during recheck while ubuntu jobs were
 consistently blocked by the libvirt problem.)

I don't know what is causing that difference but sharing my observation just in case that sounds
also interesting to the other people.


On Thu, Mar 30, 2023 at 8:18 PM Sean Mooney <smooney@redhat.com> wrote:
On Thu, 2023-03-30 at 19:54 +0900, Takashi Kajinami wrote:
> Thank you, Sylvain, for all these inputs !
>
> On Thu, Mar 30, 2023 at 7:10 PM Sylvain Bauza <sbauza@redhat.com> wrote:
>
> >
> >
> > Le jeu. 30 mars 2023 à 06:16, Takashi Kajinami <tkajinam@redhat.com> a
> > écrit :
> >
> > > Hello,
> > >
> > >
> > > Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat gate
> > > jobs have
> > > become very flaky. Further investigation revealed that the issue is
> > > related to something
> > > in libvirt from Ubuntu Jammy and that prevents detaching devices from
> > > instances[1].
> > >
> > > The same problem appears in different jobs[2] and we workaround the
> > > problem by disabling
> > > some affected jobs. In heat we also disabled some flaky tests but because
> > > of this we no longer
> > > run basic scenario tests which deploys instance/volume/network in a
> > > single stack, which means
> > > we lost the quite basic test coverage.
> > >
> > > My question is, is there anyone in the Nova team working on "fixing" this
> > > problem ?
> > > We might be able to implement some workaround (like checking status of
> > > the instances before
> > > attempting to delete it) but this should be fixed in libvirt side IMO, as
> > > this looks like a "regression"
> > > in Ubuntu Jammy.
> > > Probably we should report a bug against the libvirt package in Ubuntu but
> > > I'd like to hear some
> > > thoughts from the nova team because they are more directly affected by
> > > this problem.
> > >
> > >
> >
> > FWIW, we discussed about it yesterday on our vPTG :
> > https://etherpad.opendev.org/p/nova-bobcat-ptg#L289
> >
> > Most of the problems come from the volume detach thing. We also merged
> > some Tempest changes for not trying to cleanup some volumes if the test was
> > OK (thanks Dan for this). We also added more verifications to ask SSH to
> > wait for a bit of time before calling the instance.
> > Eventually, as you see in the etherpad, we didn't found any solutions but
> > we'll try to add some canary job for testing multiple times volume
> > attachs/detachs.
> >
>
> > We'll also continue to discuss on the CI failures during every Nova weekly
> > meetings (Tuesdays@1600UTC on #openstack-nova) and I'll want to ask a
> > cross-project session for the Vancouver pPTG for Tempest/Cinder/Nova and
> > others.
> > I leave other SMEs to reply on your other points, like for c9s.
> >
>
> It's good to hear that the issue is still getting attention. I'll catch up
> the discussion by reading the etherpad
> and will try to attend follow-up discussions if possible, especially if I
> can attend Vancouver vPTG.
>
> I know some changes have been proposed to check ssh-ability to workaround
> the problem (though
> the comment in the vPTG session indicates  that does not fully solve the
> problem) but it's still annoying
> because we don't really block resource deletions based on instance status
> (especially its internal status)
> so we eventually need some solutions here to avoid this problem, IMHO.
>
>
> >
> > > I'm now trying to set up a centos stream 9 job in Heat repo to see
> > > whether this can be reproduced
> > > if we use centos stream 9. I've been running that specific scenario test
> > > in centos stream 9 jobs
> > > in puppet repos but I've never seen this issue, so I suspect the issue is
> > > really specific to libvirt
> > > in Jammy.
> > >
> >
> >
> > Well, maybe I'm wrong, but no, we also have a centos9stream issue for
> > volume detachs :
> > https://bugs.launchpad.net/nova/+bug/1960346
> >
> >
> I just managed to launch a c9s job in heat but it seems the issue is
> reproducible in c9s as well[1].

ya i replied in paralle in my other reply i noted that we saw this issue
first in c9s then in ubuntu and we also see this in our internal downstram
ci.

changing the distro we use for the devstack jobs wont help unless we downgrade libvirt and qemu to before the
orginal change in lbvirt was done. which would break other things.
> I'll rerun the job a few more times to see how frequent the issue appears
> in c9s compared to
> ubuntu.
> We do not run many tests in puppet jobs so that might be the reason I've
> never hit it in
> puppet jobs.
>
> [1] https://review.opendev.org/c/openstack/heat/+/879014
>
>
> >
> >
> > > [1] https://bugs.launchpad.net/nova/+bug/1998274
> > > [2] https://bugs.launchpad.net/nova/+bug/1998148
> > >
> > > Thank you,
> > > Takashi
> > >
> >