On Thu, Apr 25, 2019 at 11:19 AM <iain.macdonnell@oracle.com> wrote:


On 4/25/19 1:05 AM, Damien Ciabrini wrote:
>
>
> On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz@redhat.com
> <mailto:aschultz@redhat.com>> wrote:
>
>     On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack@nemebean.com
>     <mailto:openstack@nemebean.com>> wrote:
>      >
>      >
>      >
>      > On 4/22/19 12:53 PM, Alex Schultz wrote:
>      > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec
>     <openstack@nemebean.com <mailto:openstack@nemebean.com>> wrote:
>      > >>
>      > >>
>      > >>
>      > >> On 4/20/19 1:38 AM, Michele Baldessari wrote:
>      > >>> On Fri, Apr 19, 2019 at 03:20:44PM -0700,
>     iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com> wrote:
>      > >>>>
>      > >>>> Today I discovered that this problem appears to be caused by
>     eventlet
>      > >>>> monkey-patching. I've created a bug for it:
>      > >>>>
>      > >>>> https://bugs.launchpad.net/nova/+bug/1825584
>     <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=fpephoqJ2hzc-jFPr6Rtupc2U02HBjBRd-_Lq66zQBk&s=Vb1Yw7ZkrglH0AMBZSHNfuilS5gBwh9yF2o2trkXYyM&e=>
>      > >>>
>      > >>> Hi,
>      > >>>
>      > >>> just for completeness we see this very same issue also with
>      > >>> mistral (actually it was the first service where we noticed
>     the missed
>      > >>> heartbeats). iirc Alex Schultz mentioned seeing it in ironic
>     as well,
>      > >>> although I have not personally observed it there yet.
>      > >>
>      > >> Is Mistral also mixing eventlet monkeypatching and WSGI?
>      > >>
>      > >
>      > > Looks like there is monkey patching, however we noticed it with the
>      > > engine/executor. So it's likely not just wsgi.  I think I also
>     saw it
>      > > in the ironic-conductor, though I'd have to try it out again.  I'll
>      > > spin up an undercloud today and see if I can get a more
>     complete list
>      > > of affected services. It was pretty easy to reproduce.
>      >
>      > Okay, I asked because if there's no WSGI/Eventlet combination
>     then this
>      > may be different from the Nova issue that prompted this thread. It
>      > sounds like that was being caused by a bad interaction between
>     WSGI and
>      > some Eventlet timers. If there's no WSGI involved then I wouldn't
>     expect
>      > that to happen.
>      >
>      > I guess we'll see what further investigation turns up, but based
>     on the
>      > preliminary information there may be two bugs here.
>      >
>
>     So I wasn't able to reproduce the ironic issues yet. But it's the
>     mistral executor and nova-api which exhibit the issue on the
>     undercloud.
>
>     mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR
>     oslo.messaging._drivers.impl_rabbit [-]
>     [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on
>     undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104]
>     Connection reset by peer. Trying again in 1 seconds.:
>     ConnectionResetError: [Errno 104] Connection reset by peer
>
>
>     nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR
>     oslo.messaging._drivers.impl_rabbit
>     [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe
>     9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe -
>     default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server on
>     undercloud-0.ctlplane.localdomain:5672 is unreachable: Server
>     unexpectedly closed connection. Trying again in 1 seconds.: OSError:
>     Server unexpectedly closed connection
>
>     The errors being thrown are different perhaps it is two different
>     problems.
>
>
> Correct, I think our original issue with erratic AMQP hearbeats and mod_wsgi
> was due to a change in how we run healthcheck in Stein in TripleO-deployed
> environments, so different to what Iain originally experienced it seems...
>
> For the record, up to Rocky, we used to run healthcheck scripts
> every 30 seconds, which guarantees that eventlet will wake up and
> send an AMQP heartbeat packet if a service had no AMQP traffic in the last
> 15s. It also guarantees that any incoming AMQP heartbeat packet from
> rabbitmq will be processed in at most 30s.
>
> In Stein, our healthchecks are now triggered via systemd timers, and the
> current time setting is too high to guarantee that mod_wsgi will always
> wake up on time to send/receive AMQP heartbeats to/from rabbitmq
> when there's no traffic.

If I'm reading this right, it sounds like the periodic healthcheck is
working around the underlying issue of the heartbeats not happening by
themselves (due to eventlet monkey-patching somehow interfering with
threading). The whole point of the heartbeats is to continually maintain
the connection while there's no real traffic.

     ~iain



I would agree with Iain here - it sounds as if something in mod_wsgi (or elsewhere)  is blocking in the os outside of the purview of eventlet.

FWIW here's the oslo.messaging thread responsible for sending heartbeats from the RPC client:

https://opendev.org/openstack/oslo.messaging/src/branch/stable/stein/oslo_messaging/_drivers/impl_rabbit.py#L897

Can you verify that the event.wait() call at the bottom of the loop is not waking up as per the passed in _heartbeat_wait_timeout?
thanks

--
Ken Giusti  (kgiusti@gmail.com)