Re: [oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI

7 May 2019


      On 5/4/19 4:14 PM, Damien Ciabrini wrote:
...
On Fri, May 3, 2019 at 7:59 PM Michele Baldessari <michele@acksyn.org 
<mailto:michele@acksyn.org>> wrote:
On Mon, Apr 22, 2019 at 01:21:03PM -0500, Ben Nemec wrote:
     >
     >
     > On 4/22/19 12:53 PM, Alex Schultz wrote:
     > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec
    <openstack@nemebean.com <mailto:openstack@nemebean.com>> wrote:
     > > >
     > > >
     > > >
     > > > On 4/20/19 1:38 AM, Michele Baldessari wrote:
     > > > > On Fri, Apr 19, 2019 at 03:20:44PM -0700,
    iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com> wrote:
     > > > > >
     > > > > > Today I discovered that this problem appears to be caused
    by eventlet
     > > > > > monkey-patching. I've created a bug for it:
     > > > > >
     > > > > > https://bugs.launchpad.net/nova/+bug/1825584
     > > > >
     > > > > Hi,
     > > > >
     > > > > just for completeness we see this very same issue also with
     > > > > mistral (actually it was the first service where we noticed
    the missed
     > > > > heartbeats). iirc Alex Schultz mentioned seeing it in
    ironic as well,
     > > > > although I have not personally observed it there yet.
     > > >
     > > > Is Mistral also mixing eventlet monkeypatching and WSGI?
     > > >
     > >
     > > Looks like there is monkey patching, however we noticed it with the
     > > engine/executor. So it's likely not just wsgi.  I think I also
    saw it
     > > in the ironic-conductor, though I'd have to try it out again.  I'll
     > > spin up an undercloud today and see if I can get a more
    complete list
     > > of affected services. It was pretty easy to reproduce.
     >
     > Okay, I asked because if there's no WSGI/Eventlet combination
    then this may
     > be different from the Nova issue that prompted this thread. It
    sounds like
     > that was being caused by a bad interaction between WSGI and some
    Eventlet
     > timers. If there's no WSGI involved then I wouldn't expect that
    to happen.
     >
     > I guess we'll see what further investigation turns up, but based
    on the
     > preliminary information there may be two bugs here.
So just to get some closure on this error that we have seen around
    mistral executor and tripleo with python3: this was due to the ansible
    action that called subprocess which has a different implementation in
    python3 and so the monkeypatching needs to be adapted.
Review which fixes it for us is here:
    https://review.opendev.org/#/c/656901/
Damien and I think the nova_api/eventlet/mod_wsgi has a separate
    root-cause
    (although we have not spent all too much time on that one yet)
Right, after further investigation, it appears that the problem we saw
under mod_wsgi was due to monkey patching, as Iain originally
reported. It has nothing to do with our work on healthchecks.
It turns out that running the AMQP heartbeat thread under mod_wsgi
doesn't work when the threading library is monkey_patched, because the
thread waits on a data structure [1] that has been monkey patched [2],
which makes it yield its execution instead of sleeping for 15s.
Because mod_wsgi stops the execution of its embedded interpreter, the
AMQP heartbeat thread can't be resumed until there's a message to be
processed in the mod_wsgi queue, which would resume the python
interpreter and make eventlet resume the thread.
Disabling monkey-patching in nova_api makes the scheduling issue go
away.
This sounds like the right long-term solution, but it seems unlikely to 
be backportable to the existing releases. As I understand it some 
nova-api functionality has an actual dependency on monkey-patching. Is 
there a workaround? Maybe periodically poking the API to wake up the 
wsgi interpreter?
...
Note: other services like heat-api do not use monkey patching and
aren't affected, so this seem to confirm that monkey-patching
shouldn't happen in nova_api running under mod_wsgi in the first
place.
[1] 
https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_driv...
[2] 
https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils...