[nova][heat] The next steps to "fix" libvirt problems in Ubuntu Jammy
Hello, Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat gate jobs have become very flaky. Further investigation revealed that the issue is related to something in libvirt from Ubuntu Jammy and that prevents detaching devices from instances[1]. The same problem appears in different jobs[2] and we workaround the problem by disabling some affected jobs. In heat we also disabled some flaky tests but because of this we no longer run basic scenario tests which deploys instance/volume/network in a single stack, which means we lost the quite basic test coverage. My question is, is there anyone in the Nova team working on "fixing" this problem ? We might be able to implement some workaround (like checking status of the instances before attempting to delete it) but this should be fixed in libvirt side IMO, as this looks like a "regression" in Ubuntu Jammy. Probably we should report a bug against the libvirt package in Ubuntu but I'd like to hear some thoughts from the nova team because they are more directly affected by this problem. I'm now trying to set up a centos stream 9 job in Heat repo to see whether this can be reproduced if we use centos stream 9. I've been running that specific scenario test in centos stream 9 jobs in puppet repos but I've never seen this issue, so I suspect the issue is really specific to libvirt in Jammy. [1] https://bugs.launchpad.net/nova/+bug/1998274 [2] https://bugs.launchpad.net/nova/+bug/1998148 Thank you, Takashi
Le jeu. 30 mars 2023 à 06:16, Takashi Kajinami <tkajinam@redhat.com> a écrit :
Hello,
Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat gate jobs have become very flaky. Further investigation revealed that the issue is related to something in libvirt from Ubuntu Jammy and that prevents detaching devices from instances[1].
The same problem appears in different jobs[2] and we workaround the problem by disabling some affected jobs. In heat we also disabled some flaky tests but because of this we no longer run basic scenario tests which deploys instance/volume/network in a single stack, which means we lost the quite basic test coverage.
My question is, is there anyone in the Nova team working on "fixing" this problem ? We might be able to implement some workaround (like checking status of the instances before attempting to delete it) but this should be fixed in libvirt side IMO, as this looks like a "regression" in Ubuntu Jammy. Probably we should report a bug against the libvirt package in Ubuntu but I'd like to hear some thoughts from the nova team because they are more directly affected by this problem.
FWIW, we discussed about it yesterday on our vPTG : https://etherpad.opendev.org/p/nova-bobcat-ptg#L289 Most of the problems come from the volume detach thing. We also merged some Tempest changes for not trying to cleanup some volumes if the test was OK (thanks Dan for this). We also added more verifications to ask SSH to wait for a bit of time before calling the instance. Eventually, as you see in the etherpad, we didn't found any solutions but we'll try to add some canary job for testing multiple times volume attachs/detachs. We'll also continue to discuss on the CI failures during every Nova weekly meetings (Tuesdays@1600UTC on #openstack-nova) and I'll want to ask a cross-project session for the Vancouver pPTG for Tempest/Cinder/Nova and others. I leave other SMEs to reply on your other points, like for c9s.
I'm now trying to set up a centos stream 9 job in Heat repo to see whether this can be reproduced if we use centos stream 9. I've been running that specific scenario test in centos stream 9 jobs in puppet repos but I've never seen this issue, so I suspect the issue is really specific to libvirt in Jammy.
Well, maybe I'm wrong, but no, we also have a centos9stream issue for volume detachs : https://bugs.launchpad.net/nova/+bug/1960346
[1] https://bugs.launchpad.net/nova/+bug/1998274 [2] https://bugs.launchpad.net/nova/+bug/1998148
Thank you, Takashi
Thank you, Sylvain, for all these inputs ! On Thu, Mar 30, 2023 at 7:10 PM Sylvain Bauza <sbauza@redhat.com> wrote:
Le jeu. 30 mars 2023 à 06:16, Takashi Kajinami <tkajinam@redhat.com> a écrit :
Hello,
Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat gate jobs have become very flaky. Further investigation revealed that the issue is related to something in libvirt from Ubuntu Jammy and that prevents detaching devices from instances[1].
The same problem appears in different jobs[2] and we workaround the problem by disabling some affected jobs. In heat we also disabled some flaky tests but because of this we no longer run basic scenario tests which deploys instance/volume/network in a single stack, which means we lost the quite basic test coverage.
My question is, is there anyone in the Nova team working on "fixing" this problem ? We might be able to implement some workaround (like checking status of the instances before attempting to delete it) but this should be fixed in libvirt side IMO, as this looks like a "regression" in Ubuntu Jammy. Probably we should report a bug against the libvirt package in Ubuntu but I'd like to hear some thoughts from the nova team because they are more directly affected by this problem.
FWIW, we discussed about it yesterday on our vPTG : https://etherpad.opendev.org/p/nova-bobcat-ptg#L289
Most of the problems come from the volume detach thing. We also merged some Tempest changes for not trying to cleanup some volumes if the test was OK (thanks Dan for this). We also added more verifications to ask SSH to wait for a bit of time before calling the instance. Eventually, as you see in the etherpad, we didn't found any solutions but we'll try to add some canary job for testing multiple times volume attachs/detachs.
We'll also continue to discuss on the CI failures during every Nova weekly meetings (Tuesdays@1600UTC on #openstack-nova) and I'll want to ask a cross-project session for the Vancouver pPTG for Tempest/Cinder/Nova and others. I leave other SMEs to reply on your other points, like for c9s.
It's good to hear that the issue is still getting attention. I'll catch up the discussion by reading the etherpad and will try to attend follow-up discussions if possible, especially if I can attend Vancouver vPTG. I know some changes have been proposed to check ssh-ability to workaround the problem (though the comment in the vPTG session indicates that does not fully solve the problem) but it's still annoying because we don't really block resource deletions based on instance status (especially its internal status) so we eventually need some solutions here to avoid this problem, IMHO.
I'm now trying to set up a centos stream 9 job in Heat repo to see whether this can be reproduced if we use centos stream 9. I've been running that specific scenario test in centos stream 9 jobs in puppet repos but I've never seen this issue, so I suspect the issue is really specific to libvirt in Jammy.
Well, maybe I'm wrong, but no, we also have a centos9stream issue for volume detachs : https://bugs.launchpad.net/nova/+bug/1960346
I just managed to launch a c9s job in heat but it seems the issue is reproducible in c9s as well[1]. I'll rerun the job a few more times to see how frequent the issue appears in c9s compared to ubuntu. We do not run many tests in puppet jobs so that might be the reason I've never hit it in puppet jobs. [1] https://review.opendev.org/c/openstack/heat/+/879014
[1] https://bugs.launchpad.net/nova/+bug/1998274 [2] https://bugs.launchpad.net/nova/+bug/1998148
Thank you, Takashi
On Thu, 2023-03-30 at 19:54 +0900, Takashi Kajinami wrote:
Thank you, Sylvain, for all these inputs !
On Thu, Mar 30, 2023 at 7:10 PM Sylvain Bauza <sbauza@redhat.com> wrote:
Le jeu. 30 mars 2023 à 06:16, Takashi Kajinami <tkajinam@redhat.com> a écrit :
Hello,
Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat gate jobs have become very flaky. Further investigation revealed that the issue is related to something in libvirt from Ubuntu Jammy and that prevents detaching devices from instances[1].
The same problem appears in different jobs[2] and we workaround the problem by disabling some affected jobs. In heat we also disabled some flaky tests but because of this we no longer run basic scenario tests which deploys instance/volume/network in a single stack, which means we lost the quite basic test coverage.
My question is, is there anyone in the Nova team working on "fixing" this problem ? We might be able to implement some workaround (like checking status of the instances before attempting to delete it) but this should be fixed in libvirt side IMO, as this looks like a "regression" in Ubuntu Jammy. Probably we should report a bug against the libvirt package in Ubuntu but I'd like to hear some thoughts from the nova team because they are more directly affected by this problem.
FWIW, we discussed about it yesterday on our vPTG : https://etherpad.opendev.org/p/nova-bobcat-ptg#L289
Most of the problems come from the volume detach thing. We also merged some Tempest changes for not trying to cleanup some volumes if the test was OK (thanks Dan for this). We also added more verifications to ask SSH to wait for a bit of time before calling the instance. Eventually, as you see in the etherpad, we didn't found any solutions but we'll try to add some canary job for testing multiple times volume attachs/detachs.
We'll also continue to discuss on the CI failures during every Nova weekly meetings (Tuesdays@1600UTC on #openstack-nova) and I'll want to ask a cross-project session for the Vancouver pPTG for Tempest/Cinder/Nova and others. I leave other SMEs to reply on your other points, like for c9s.
It's good to hear that the issue is still getting attention. I'll catch up the discussion by reading the etherpad and will try to attend follow-up discussions if possible, especially if I can attend Vancouver vPTG.
I know some changes have been proposed to check ssh-ability to workaround the problem (though the comment in the vPTG session indicates that does not fully solve the problem) but it's still annoying because we don't really block resource deletions based on instance status (especially its internal status) so we eventually need some solutions here to avoid this problem, IMHO.
I'm now trying to set up a centos stream 9 job in Heat repo to see whether this can be reproduced if we use centos stream 9. I've been running that specific scenario test in centos stream 9 jobs in puppet repos but I've never seen this issue, so I suspect the issue is really specific to libvirt in Jammy.
Well, maybe I'm wrong, but no, we also have a centos9stream issue for volume detachs : https://bugs.launchpad.net/nova/+bug/1960346
I just managed to launch a c9s job in heat but it seems the issue is reproducible in c9s as well[1].
ya i replied in paralle in my other reply i noted that we saw this issue first in c9s then in ubuntu and we also see this in our internal downstram ci. changing the distro we use for the devstack jobs wont help unless we downgrade libvirt and qemu to before the orginal change in lbvirt was done. which would break other things.
I'll rerun the job a few more times to see how frequent the issue appears in c9s compared to ubuntu. We do not run many tests in puppet jobs so that might be the reason I've never hit it in puppet jobs.
[1] https://review.opendev.org/c/openstack/heat/+/879014
[1] https://bugs.launchpad.net/nova/+bug/1998274 [2] https://bugs.launchpad.net/nova/+bug/1998148
Thank you, Takashi
Thanks Sean for these replies. These make sense to me. As I mentioned in my earlier reply, I run c9s jobs several times and I did confirm the issue can be reproduced in c9s. # The attempts can be found here: https://review.opendev.org/c/openstack/heat/+/879014/1 The interesting finding was that the issue appears in c9s much less frequently than Ubuntu. (The issue is reproduced in c9s once but I didn't hit it during recheck while ubuntu jobs were consistently blocked by the libvirt problem.) I don't know what is causing that difference but sharing my observation just in case that sounds also interesting to the other people. On Thu, Mar 30, 2023 at 8:18 PM Sean Mooney <smooney@redhat.com> wrote:
Thank you, Sylvain, for all these inputs !
On Thu, Mar 30, 2023 at 7:10 PM Sylvain Bauza <sbauza@redhat.com> wrote:
Le jeu. 30 mars 2023 à 06:16, Takashi Kajinami <tkajinam@redhat.com> a écrit :
Hello,
Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat
gate
jobs have become very flaky. Further investigation revealed that the issue is related to something in libvirt from Ubuntu Jammy and that prevents detaching devices from instances[1].
The same problem appears in different jobs[2] and we workaround the problem by disabling some affected jobs. In heat we also disabled some flaky tests but because of this we no longer run basic scenario tests which deploys instance/volume/network in a single stack, which means we lost the quite basic test coverage.
My question is, is there anyone in the Nova team working on "fixing"
On Thu, 2023-03-30 at 19:54 +0900, Takashi Kajinami wrote: this
problem ? We might be able to implement some workaround (like checking status of the instances before attempting to delete it) but this should be fixed in libvirt side IMO, as this looks like a "regression" in Ubuntu Jammy. Probably we should report a bug against the libvirt package in Ubuntu but I'd like to hear some thoughts from the nova team because they are more directly affected by this problem.
FWIW, we discussed about it yesterday on our vPTG : https://etherpad.opendev.org/p/nova-bobcat-ptg#L289
Most of the problems come from the volume detach thing. We also merged some Tempest changes for not trying to cleanup some volumes if the test was OK (thanks Dan for this). We also added more verifications to ask SSH to wait for a bit of time before calling the instance. Eventually, as you see in the etherpad, we didn't found any solutions but we'll try to add some canary job for testing multiple times volume attachs/detachs.
We'll also continue to discuss on the CI failures during every Nova weekly meetings (Tuesdays@1600UTC on #openstack-nova) and I'll want to ask a cross-project session for the Vancouver pPTG for Tempest/Cinder/Nova and others. I leave other SMEs to reply on your other points, like for c9s.
It's good to hear that the issue is still getting attention. I'll catch up the discussion by reading the etherpad and will try to attend follow-up discussions if possible, especially if I can attend Vancouver vPTG.
I know some changes have been proposed to check ssh-ability to workaround the problem (though the comment in the vPTG session indicates that does not fully solve the problem) but it's still annoying because we don't really block resource deletions based on instance status (especially its internal status) so we eventually need some solutions here to avoid this problem, IMHO.
I'm now trying to set up a centos stream 9 job in Heat repo to see whether this can be reproduced if we use centos stream 9. I've been running that specific scenario
test
in centos stream 9 jobs in puppet repos but I've never seen this issue, so I suspect the issue is really specific to libvirt in Jammy.
Well, maybe I'm wrong, but no, we also have a centos9stream issue for volume detachs : https://bugs.launchpad.net/nova/+bug/1960346
I just managed to launch a c9s job in heat but it seems the issue is reproducible in c9s as well[1].
ya i replied in paralle in my other reply i noted that we saw this issue first in c9s then in ubuntu and we also see this in our internal downstram ci.
changing the distro we use for the devstack jobs wont help unless we downgrade libvirt and qemu to before the orginal change in lbvirt was done. which would break other things.
I'll rerun the job a few more times to see how frequent the issue appears in c9s compared to ubuntu. We do not run many tests in puppet jobs so that might be the reason I've never hit it in puppet jobs.
[1] https://review.opendev.org/c/openstack/heat/+/879014
[1] https://bugs.launchpad.net/nova/+bug/1998274 [2] https://bugs.launchpad.net/nova/+bug/1998148
Thank you, Takashi
On Thu, 2023-03-30 at 12:10 +0200, Sylvain Bauza wrote:
Le jeu. 30 mars 2023 à 06:16, Takashi Kajinami <tkajinam@redhat.com> a écrit :
Hello,
Since we migrated our jobs from Ubuntu Focal to Ubuntu Jammy, heat gate jobs have become very flaky. Further investigation revealed that the issue is related to something in libvirt from Ubuntu Jammy and that prevents detaching devices from instances[1].
for what its worth this is not a probelm that is new in jammy it also affect the libvirt/qemu verion in focal and i centos 9 stream. this detach issue was intoduced in qemu as a sideeffect of fixign a security issue. we mostly mitigated the impact on Focal with some tempest changes but not entirly
The same problem appears in different jobs[2] and we workaround the problem by disabling some affected jobs. In heat we also disabled some flaky tests but because of this we no longer run basic scenario tests which deploys instance/volume/network in a single stack, which means we lost the quite basic test coverage.
My question is, is there anyone in the Nova team working on "fixing" this problem ?
yes and no we cannot fix this in nova as it not a nova issue its a issue with qemu/libvirt and possible cirros.
one possible "fix" is to stop using cirros so i did a few things last night first i tried using the ubuntu-minimal-cloud-image this is strip down image that is smaller and uses less memory while it could boot with the normal cirros flavor with 128mb of ram it OOMd cloud-init fortunetly it was after ssh was set up so i could log in but its too close to the memory limit to use. second attempt was to revive my alpine disk image builder serise https://review.opendev.org/c/openstack/diskimage-builder/+/755410 that now works to generate really light weight image (its using about 30mb of ram while idel) i am going to try creating a job that will use that instead of cirros for now im just goign to use a pre playbook to build the image in the job and make destack use that instead.
We might be able to implement some workaround (like checking status of the instances before attempting to delete it) but this should be fixed in libvirt side IMO, as this looks like a "regression" in Ubuntu Jammy. This is not new in Jammy and it should affect RHEL9
i am very very surpsied this is not causeing us a lot of internal pain for our downstream ci as it was breaking centos 9 before it started affecting ubuntu. we have seen downstream detach issues but the sshablae changes in tempest mostly helped so this is not just a ubuntu issue its affecting all distros includeing rhel. this is the upstream libvirt bug for the current probelm https://gitlab.com/libvirt/libvirt/-/issues/309 https://bugzilla.redhat.com/show_bug.cgi?id=2087047 is the downstream tracker for the libvirt team to actully fix this i have left a comment there to see if i can move that along.
Probably we should report a bug against the libvirt package in Ubuntu but I'd like to hear some thoughts from the nova team because they are more directly affected by this problem.
FWIW, we discussed about it yesterday on our vPTG : https://etherpad.opendev.org/p/nova-bobcat-ptg#L289
Most of the problems come from the volume detach thing. We also merged some Tempest changes for not trying to cleanup some volumes if the test was OK (thanks Dan for this). We also added more verifications to ask SSH to wait for a bit of time before calling the instance. Eventually, as you see in the etherpad, we didn't found any solutions but we'll try to add some canary job for testing multiple times volume attachs/detachs.
We'll also continue to discuss on the CI failures during every Nova weekly meetings (Tuesdays@1600UTC on #openstack-nova) and I'll want to ask a cross-project session for the Vancouver pPTG for Tempest/Cinder/Nova and others. I leave other SMEs to reply on your other points, like for c9s. c9s hit this before ubuntu did it will not help
I'm now trying to set up a centos stream 9 job in Heat repo to see whether this can be reproduced if we use centos stream 9. I've been running that specific scenario test in centos stream 9 jobs in puppet repos but I've never seen this issue, so I suspect the issue is really specific to libvirt in Jammy.
Well, maybe I'm wrong, but no, we also have a centos9stream issue for volume detachs : https://bugs.launchpad.net/nova/+bug/1960346
[1] https://bugs.launchpad.net/nova/+bug/1998274 [2] https://bugs.launchpad.net/nova/+bug/1998148
Thank you, Takashi
participants (3)
-
Sean Mooney
-
Sylvain Bauza
-
Takashi Kajinami