[oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI

iain.macdonnell at oracle.com iain.macdonnell at oracle.com
Thu Apr 25 15:10:14 UTC 2019



On 4/25/19 1:05 AM, Damien Ciabrini wrote:
> 
> 
> On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz at redhat.com 
> <mailto:aschultz at redhat.com>> wrote:
> 
>     On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack at nemebean.com
>     <mailto:openstack at nemebean.com>> wrote:
>      >
>      >
>      >
>      > On 4/22/19 12:53 PM, Alex Schultz wrote:
>      > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec
>     <openstack at nemebean.com <mailto:openstack at nemebean.com>> wrote:
>      > >>
>      > >>
>      > >>
>      > >> On 4/20/19 1:38 AM, Michele Baldessari wrote:
>      > >>> On Fri, Apr 19, 2019 at 03:20:44PM -0700,
>     iain.macdonnell at oracle.com <mailto:iain.macdonnell at oracle.com> wrote:
>      > >>>>
>      > >>>> Today I discovered that this problem appears to be caused by
>     eventlet
>      > >>>> monkey-patching. I've created a bug for it:
>      > >>>>
>      > >>>> https://bugs.launchpad.net/nova/+bug/1825584
>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=fpephoqJ2hzc-jFPr6Rtupc2U02HBjBRd-_Lq66zQBk&s=Vb1Yw7ZkrglH0AMBZSHNfuilS5gBwh9yF2o2trkXYyM&e=>
>      > >>>
>      > >>> Hi,
>      > >>>
>      > >>> just for completeness we see this very same issue also with
>      > >>> mistral (actually it was the first service where we noticed
>     the missed
>      > >>> heartbeats). iirc Alex Schultz mentioned seeing it in ironic
>     as well,
>      > >>> although I have not personally observed it there yet.
>      > >>
>      > >> Is Mistral also mixing eventlet monkeypatching and WSGI?
>      > >>
>      > >
>      > > Looks like there is monkey patching, however we noticed it with the
>      > > engine/executor. So it's likely not just wsgi.  I think I also
>     saw it
>      > > in the ironic-conductor, though I'd have to try it out again.  I'll
>      > > spin up an undercloud today and see if I can get a more
>     complete list
>      > > of affected services. It was pretty easy to reproduce.
>      >
>      > Okay, I asked because if there's no WSGI/Eventlet combination
>     then this
>      > may be different from the Nova issue that prompted this thread. It
>      > sounds like that was being caused by a bad interaction between
>     WSGI and
>      > some Eventlet timers. If there's no WSGI involved then I wouldn't
>     expect
>      > that to happen.
>      >
>      > I guess we'll see what further investigation turns up, but based
>     on the
>      > preliminary information there may be two bugs here.
>      >
> 
>     So I wasn't able to reproduce the ironic issues yet. But it's the
>     mistral executor and nova-api which exhibit the issue on the
>     undercloud.
> 
>     mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR
>     oslo.messaging._drivers.impl_rabbit [-]
>     [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on
>     undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104]
>     Connection reset by peer. Trying again in 1 seconds.:
>     ConnectionResetError: [Errno 104] Connection reset by peer
> 
> 
>     nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR
>     oslo.messaging._drivers.impl_rabbit
>     [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe
>     9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe -
>     default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server on
>     undercloud-0.ctlplane.localdomain:5672 is unreachable: Server
>     unexpectedly closed connection. Trying again in 1 seconds.: OSError:
>     Server unexpectedly closed connection
> 
>     The errors being thrown are different perhaps it is two different
>     problems.
> 
> 
> Correct, I think our original issue with erratic AMQP hearbeats and mod_wsgi
> was due to a change in how we run healthcheck in Stein in TripleO-deployed
> environments, so different to what Iain originally experienced it seems...
> 
> For the record, up to Rocky, we used to run healthcheck scripts
> every 30 seconds, which guarantees that eventlet will wake up and
> send an AMQP heartbeat packet if a service had no AMQP traffic in the last
> 15s. It also guarantees that any incoming AMQP heartbeat packet from
> rabbitmq will be processed in at most 30s.
> 
> In Stein, our healthchecks are now triggered via systemd timers, and the
> current time setting is too high to guarantee that mod_wsgi will always
> wake up on time to send/receive AMQP heartbeats to/from rabbitmq
> when there's no traffic.

If I'm reading this right, it sounds like the periodic healthcheck is 
working around the underlying issue of the heartbeats not happening by 
themselves (due to eventlet monkey-patching somehow interfering with 
threading). The whole point of the heartbeats is to continually maintain 
the connection while there's no real traffic.

     ~iain




More information about the openstack-discuss mailing list