[nova][dev] Fixing eventlet monkey patching in Nova

15 Mar 2019

      We're hitting an issue with eventlet monkey patching in OSP15, which
uses python 3.6. The issue manifests as infinite recursion when a nova
service connects to another openstack service over SSL. The immediate
cause is that we import urllib3 before calling
eventlet.monkey_patch(). I can't find the precise details now, but the
reason for this relates to initialisation done by either the ssl or
urllib3 libraries at import time which is therefore not patched by
eventlet.

The detail isn't that important, in any case. The issue may just be
specific to a particular combination of library versions we're using
in OSP, or possibly that our default configuration exercises something
we're not doing in the gate. Regardless, unless we change where we
monkey patch we're going to hit this again in the future with other
library quirks. Our own eventlet best practises say: "When using
eventlet.monkey_patch, do it first or not at all."[1] We're not doing
that for both types of service:

* non-wsgi

For non-wsgi services we monkey patch in nova/cmd/__init__.py with the
call utils.monkey_patch(). At first glance that appears to be very
early, but note that this required importing nova.utils first, which
in turn imports the world, and all before monkey patching.

I believe this was regressed in Stein by change
Ie7bf5d012e2ccbcd63c262ddaf739782afcdaf56[2], which was attempting to
fix the wsgi case.

* wsgi

For wsgi services our entry point is nova.api.openstack.wsgi_app,
which calls utils.monkey_patch(). This suffers the same problem as the
non-wsgi case, but additionally suffers from having loaded
nova/api/openstack/__init__.py first, which also imports the world.

Incidentally, as noted in change
Ie7bf5d012e2ccbcd63c262ddaf739782afcdaf56 we *do* currently require
eventlet in at least the nova api service, even when running as a wsgi
service. This has not regressed the wsgi case, though, as that would
already have been broken due to the libraries imported by
nova/api/openstack/__init__.py.

Lee Yarwood originally posted (and I took over from him)
https://review.openstack.org/#/c/626952/ with the intention of fixing
the wsgi case. We semi-abandoned it because, while the fix seemed to
make sense, it didn't fix the wsgi case when tested. I now realise
that this is due to the libraries loaded by
nova/api/openstack/__init__.py. I have resubmitted this patch as it
fixes the non-wsgi case, and also updated the commit message to be
accurate according to my current understanding.

However, fixing the non-wsgi case is more complicated. We can't do the
monkey patching in nova.api.openstack.wsgi_app because
nova/api/openstack/__init__.py imports the world.

I believe we have 3 options:

1. Don't import the world in nova/api/openstack/__init__.py
2. Change the wsgi entry point to, for eg, nova.wsgi_app
3. Do monkey patching in nova/__init__.py

Of the above, I haven't investigated either 1 or 2 in any great
detail. I suspect (1) would be quite invasive, and (2) would
presumably have a deployment impact. I believe that the reason we
haven't previously done (3) has been the desire not to monkey patch in
certain circumstances. However, as it stands today we always need
eventlet, so (3) seems like an expedient route. The code change
required is very small.

In testing the above I threw up 2 follow on patches:

* https://review.openstack.org/643579

This makes a couple of cleanups, but significantly adds an assertion
that problem libraries haven't been imported before monkey patching.
As expected, this kills nova-api (wsgi):

http://logs.openstack.org/79/643579/1/check/tempest-full-py3/94b0156/control...

but not nova-compute (non-wsgi):

http://logs.openstack.org/79/643579/1/check/tempest-full-py3/94b0156/control...

Interestingly, I also bumped the minimum eventlet version in that
patch to remove some no-longer-required cruft from oslo_service. If
I'm reading this failure correctly, we're pinning eventlet to version
0.18.2 in the gate. Why such an old eventlet? I can't see where this
pin is coming from, but my (unsubstantiated) guess is that this is the
principal difference from OSP, where we're currently installing
eventlet 0.24.1:

http://logs.openstack.org/79/643579/1/check/requirements-check/fc9f093/job-o...

* https://review.openstack.org/#/c/643581/

Here I just switched to top-level monkey patching. This is still
running, but at first glance the tempest runs don't seem to have
failed early, which they would have given that the patch is stacked on
top of the the assertions from the prior patch.

TL;DR I want to move monkey patching to nova/__init__.py because, for
now at least, we always require it anyway. When we eventually remove
the requirement for wsgi services to monkey patch we can move monkey
patching back to nova/cmd/__init__.py.

Matt

[1] https://specs.openstack.org/openstack/openstack-specs/specs/eventlet-best-pr...
[2] https://review.openstack.org/#/c/592285/
-- 
Matthew Booth
Red Hat OpenStack Engineer, Compute DFG

Phone: +442070094448 (UK)

Matthew Booth

Ben Nemec

Chris Dent

melanie witt

Sean Mooney

Monty Taylor

Ben Nemec

Sean Mooney

Michael Still

Sean Mooney

Monty Taylor

Sean Mooney

melanie witt

Sean Mooney

Thomas Goirand

Tony Breeds

Zane Bitter

Matthew Booth

Matthew Booth

Thomas Goirand

tags

participants (10)