[gate] many jobs failing for nova-compute libvirt driver error
Hey all, This is just a FYI that we've got another gate failure where nova-compute is failing with the following error raised from the libvirt driver [1]: "TypeError: Parameterized generics cannot be used with class" which was raised from the eventlet.tpool code. (I don't yet understand what this error means exactly.) Because of this, we suspect this is related to the eventlet package version bump from 0.26.1 to 0.28.0 that happened in the upper-constraints change to openstack/requirements that merged earlier today: https://review.opendev.org/#/c/750084/47/upper-constraints.txt@150 But we're not certain yet exactly what is happening and whether this ^ is indeed the cause. I am currently investigating to find the root cause of failure and how to fix it/confirm which package we should pin for now. I am trying out a DNM nova patch to pin eventlet to 0.26.1 to see what happens, while I look into more depth at what the error means and whether it's something we should fix in the nova code. If anyone can help, it would be appreciated. Best, -melanie [1] https://zuul.opendev.org/t/openstack/build/20f1d309663347c28112b711b82b9c03/...
I've updated subject line to more specifically describe the issue. Details inline. On 10/23/20 12:08, melanie witt wrote:
Hey all,
This is just a FYI that we've got another gate failure where nova-compute is failing with the following error raised from the libvirt driver [1]:
"TypeError: Parameterized generics cannot be used with class"
which was raised from the eventlet.tpool code. (I don't yet understand what this error means exactly.)
Because of this, we suspect this is related to the eventlet package version bump from 0.26.1 to 0.28.0 that happened in the upper-constraints change to openstack/requirements that merged earlier today:
https://review.opendev.org/#/c/750084/47/upper-constraints.txt@150
But we're not certain yet exactly what is happening and whether this ^ is indeed the cause.
I am currently investigating to find the root cause of failure and how to fix it/confirm which package we should pin for now. I am trying out a DNM nova patch to pin eventlet to 0.26.1 to see what happens, while I look into more depth at what the error means and whether it's something we should fix in the nova code. Update: I tried a DNM patch to revert the eventlet bump and nova-compute still failed with the same error. So it's not the eventlet package version.
I tried another DNM patch to revert the libvirt-python bump from 6.6.0 to 6.8.0: https://review.opendev.org/759552 and rechecked my nova patch that Depends-On it: https://review.opendev.org/759506 And it nova-compute runs successfully. So it's definitely the libvirt-python patch bump that's causing the failure. Yet, on the upper-constraints bump patch, nova-compute ran fine with the new libvirt-python version 6.8.0. WHY?! I found that the upper-constraints patch ran tempest with python 3.8 and it passed with libvirt-python 6.8.0 [2] BUT our jobs (and some other projects jobs) run with python 3.6 [3] and it fails with libvirt-python 6.8.0. So there appears to be some kind of incompatibility between python 3.6 and libvirt-python 6.8.0. I'm not sure what we can do here besides revert the libvirt-python bump and go back to 6.6.0. AFAIK we still need to have integration test coverage for python 3.6 still (correct me if I'm wrong). I'm also not sure if there's way we could configure libvirt-python 6.8.0 for python 3.8 and libvirt python 6.6.0 for python 3.6. Our jobs running python 3.8 are passing: nova-next, nova-ceph-multistore, nova-multi-cell, tempest-ipv6-only. Our jobs running python 3.6 are failing: tempest-integrated-compute, grenade, nova-live-migration. I had a look through the libvirt-python repo to see what changed between 6.6.0 and 6.8.0 and can't tell what's the root cause. We get the TypeError raised in nova-compute when we call the libvirt openAuth method but nothing in this diff [4] looks problematic? I've opened a bug to capture the info I found so far: https://bugs.launchpad.net/nova/+bug/1901383 -melanie
[1] https://zuul.opendev.org/t/openstack/build/20f1d309663347c28112b711b82b9c03/...
[2] https://zuul.opendev.org/t/openstack/build/81f1f400b43b4e22bdae3ce1c2de92a8/... [3] https://zuul.opendev.org/t/openstack/build/96b64d8e869f46d1a88ee0f728f013f6/... [4] https://github.com/libvirt/libvirt-python/compare/v6.6.0...v6.8.0#diff-a3d04...
On Sun, Oct 25, 2020 at 12:45:57AM -0700, melanie witt wrote:
I've updated subject line to more specifically describe the issue. Details inline.
[...] Hi, Mel
I tried another DNM patch to revert the libvirt-python bump from 6.6.0 to 6.8.0:
https://review.opendev.org/759552
and rechecked my nova patch that Depends-On it:
https://review.opendev.org/759506
And it nova-compute runs successfully. So it's definitely the libvirt-python patch bump that's causing the failure.
Yet, on the upper-constraints bump patch, nova-compute ran fine with the new libvirt-python version 6.8.0. WHY?!
Yeah, that really is puzzling / strange.
I found that the upper-constraints patch ran tempest with python 3.8 and it passed with libvirt-python 6.8.0 [2] BUT our jobs (and some other projects jobs) run with python 3.6 [3] and it fails with libvirt-python 6.8.0.
So there appears to be some kind of incompatibility between python 3.6 and libvirt-python 6.8.0. I'm not sure what we can do here besides revert the libvirt-python bump and go back to 6.6.0. AFAIK we still need to have integration test coverage for python 3.6 still (correct me if I'm wrong). I'm also not sure if there's way we could configure libvirt-python 6.8.0 for python 3.8 and libvirt python 6.6.0 for python 3.6.
Our jobs running python 3.8 are passing: nova-next, nova-ceph-multistore, nova-multi-cell, tempest-ipv6-only.
Our jobs running python 3.6 are failing: tempest-integrated-compute, grenade, nova-live-migration.
I had a look through the libvirt-python repo to see what changed between 6.6.0 and 6.8.0 and can't tell what's the root cause. We get the TypeError raised in nova-compute when we call the libvirt openAuth method but nothing in this diff [4] looks problematic?
Great sleuthing so far. (Not to mention the terribly cryptic orignal error message) As a small addendum, was talking to Dan Berrangé from libvirt and pointed to the openAuth() method ... and he wrote a quick 'libvirt-python' unit test for the openAuth() method: https://gitlab.com/libvirt/libvirt-python/-/merge_requests/28/diffs ... and it seems to be working as expected (based on the CI result): https://gitlab.com/libvirt/libvirt-python/-/merge_requests/28/pipelines ... it seems to validate several operating system platforms (and Python versions), including Ubuntu 18.04: https://gitlab.com/berrange/libvirt-python/-/pipelines/207610554
I've opened a bug to capture the info I found so far:
https://bugs.launchpad.net/nova/+bug/1901383
-melanie
[1] https://zuul.opendev.org/t/openstack/build/20f1d309663347c28112b711b82b9c03/...
[2] https://zuul.opendev.org/t/openstack/build/81f1f400b43b4e22bdae3ce1c2de92a8/... [3] https://zuul.opendev.org/t/openstack/build/96b64d8e869f46d1a88ee0f728f013f6/... [4] https://github.com/libvirt/libvirt-python/compare/v6.6.0...v6.8.0#diff-a3d04...
-- /kashyap
On Mon, Oct 26, 2020 at 01:17:17PM +0100, Kashyap Chamarthy wrote: [...]
Great sleuthing so far. (Not to mention the terribly cryptic orignal error message)
As a small addendum, was talking to Dan Berrangé from libvirt and pointed to the openAuth() method ... and he wrote a quick 'libvirt-python' unit test for the openAuth() method:
https://gitlab.com/libvirt/libvirt-python/-/merge_requests/28/diffs
... and it seems to be working as expected (based on the CI result):
https://gitlab.com/libvirt/libvirt-python/-/merge_requests/28/pipelines
... it seems to validate several operating system platforms (and Python versions), including Ubuntu 18.04:
https://gitlab.com/berrange/libvirt-python/-/pipelines/207610554
So ... Dan was able to reproduce the problem; the culprit was the the combination of thread pool, Python 3.6.x and type hinting; the above unit test from Dan didn't make use of 'tpool'. He says: tpool + Python 3.6.x + type hinting is fubar If I add 'tpool' into my new unit tests, then it crashes & burns with the error msg Mel got [...] -- /kashyap
On 26-10-20 13:32:20, Kashyap Chamarthy wrote:
On Mon, Oct 26, 2020 at 01:17:17PM +0100, Kashyap Chamarthy wrote:
[...]
Great sleuthing so far. (Not to mention the terribly cryptic orignal error message)
As a small addendum, was talking to Dan Berrang̮̩ from libvirt and pointed to the openAuth() method ... and he wrote a quick 'libvirt-python' unit test for the openAuth() method:
https://gitlab.com/libvirt/libvirt-python/-/merge_requests/28/diffs
... and it seems to be working as expected (based on the CI result):
https://gitlab.com/libvirt/libvirt-python/-/merge_requests/28/pipelines
... it seems to validate several operating system platforms (and Python versions), including Ubuntu 18.04:
https://gitlab.com/berrange/libvirt-python/-/pipelines/207610554
So ... Dan was able to reproduce the problem; the culprit was the the combination of thread pool, Python 3.6.x and type hinting; the above unit test from Dan didn't make use of 'tpool'. He says:
tpool + Python 3.6.x + type hinting is fubar
If I add 'tpool' into my new unit tests, then it crashes & burns with the error msg Mel got
[...]
I've pushed https://review.opendev.org/#/c/759831/ based on the feedback from danpb, lets see if it resolves the issue. -- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76
On 10/27/20 02:08, Lee Yarwood wrote:
On 26-10-20 13:32:20, Kashyap Chamarthy wrote:
On Mon, Oct 26, 2020 at 01:17:17PM +0100, Kashyap Chamarthy wrote:
[...]
Great sleuthing so far. (Not to mention the terribly cryptic orignal error message)
As a small addendum, was talking to Dan Berrang̮̩ from libvirt and pointed to the openAuth() method ... and he wrote a quick 'libvirt-python' unit test for the openAuth() method:
https://gitlab.com/libvirt/libvirt-python/-/merge_requests/28/diffs
... and it seems to be working as expected (based on the CI result):
https://gitlab.com/libvirt/libvirt-python/-/merge_requests/28/pipelines
... it seems to validate several operating system platforms (and Python versions), including Ubuntu 18.04:
https://gitlab.com/berrange/libvirt-python/-/pipelines/207610554
So ... Dan was able to reproduce the problem; the culprit was the the combination of thread pool, Python 3.6.x and type hinting; the above unit test from Dan didn't make use of 'tpool'. He says:
tpool + Python 3.6.x + type hinting is fubar
If I add 'tpool' into my new unit tests, then it crashes & burns with the error msg Mel got
[...]
I've pushed https://review.opendev.org/#/c/759831/ based on the feedback from danpb, lets see if it resolves the issue.
Thanks to all for picking this up and running with it! I'm happy to see we've got a real fix for this and we don't have to revert the libvirt-python version bump. Cheers, -melanie
On 27-10-20 09:08:17, Lee Yarwood wrote:
On 26-10-20 13:32:20, Kashyap Chamarthy wrote:
On Mon, Oct 26, 2020 at 01:17:17PM +0100, Kashyap Chamarthy wrote:
[...]
Great sleuthing so far. (Not to mention the terribly cryptic orignal error message)
As a small addendum, was talking to Dan Berrang̮̩ from libvirt and pointed to the openAuth() method ... and he wrote a quick 'libvirt-python' unit test for the openAuth() method:
https://gitlab.com/libvirt/libvirt-python/-/merge_requests/28/diffs
... and it seems to be working as expected (based on the CI result):
https://gitlab.com/libvirt/libvirt-python/-/merge_requests/28/pipelines
... it seems to validate several operating system platforms (and Python versions), including Ubuntu 18.04:
https://gitlab.com/berrange/libvirt-python/-/pipelines/207610554
So ... Dan was able to reproduce the problem; the culprit was the the combination of thread pool, Python 3.6.x and type hinting; the above unit test from Dan didn't make use of 'tpool'. He says:
tpool + Python 3.6.x + type hinting is fubar
If I add 'tpool' into my new unit tests, then it crashes & burns with the error msg Mel got
[...]
I've pushed https://review.opendev.org/#/c/759831/ based on the feedback from danpb, lets see if it resolves the issue.
Thus far we've hit the following issues in the gate: nova-grenade-multinode ---------------------- libvirt.libvirtError: internal error: missing block job data for disk 'vda' https://bugs.launchpad.net/nova/+bug/1901739 nova-next --------- volume delete fails because cinder-rootwrap lvs fails with exit code 139 Edit https://bugs.launchpad.net/cinder/+bug/1901783 grenade ------- 60_nova/resources.sh:106:ping_check_public fails intermittently https://bugs.launchpad.net/neutron/+bug/1463631 The nova-ceph-multistore job has also failed, timing out after g-api issues that I've not been able to nail down. frickler has suggested force merging this so we can open the gate back up for other projects, I'm not against this by the ultimate decision rests with gibi as PTL. I've also just noticed that the change is now somehow in the check and gate queues at the same time?! It has already passed the check queue a few times so I'm not worried about this but I wonder if this is a zuul bug of some kind? -- Lee Yarwood A5D1 9385 88CB 7E5F BE64 6618 BCA6 6E33 F672 2D76
On 2020-10-28 09:52:40 +0000 (+0000), Lee Yarwood wrote: [...]
I've also just noticed that the change is now somehow in the check and gate queues at the same time?! It has already passed the check queue a few times so I'm not worried about this but I wonder if this is a zuul bug of some kind?
Not a Zuul bug, I was directly enqueuing it into the gate pipeline (as a Zuul administrator) in order to attempt to speed up merging, after discussing the change with melwitt in #openstack-infra. Since that's an unusual circumstance, we haven't configured the gate pipeline in the openstack tenant to supercede the check pipeline, thus builds there are not cancelled when it happens. -- Jeremy Stanley
participants (4)
-
Jeremy Stanley
-
Kashyap Chamarthy
-
Lee Yarwood
-
melanie witt