On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz@redhat.com> wrote:

On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack@nemebean.com> wrote:
>
>
>
> On 4/22/19 12:53 PM, Alex Schultz wrote:
> > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com> wrote:
> >>
> >>
> >>
> >> On 4/20/19 1:38 AM, Michele Baldessari wrote:
> >>> On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com wrote:
> >>>>
> >>>> Today I discovered that this problem appears to be caused by eventlet
> >>>> monkey-patching. I've created a bug for it:
> >>>>
> >>>> https://bugs.launchpad.net/nova/+bug/1825584
> >>>
> >>> Hi,
> >>>
> >>> just for completeness we see this very same issue also with
> >>> mistral (actually it was the first service where we noticed the missed
> >>> heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well,
> >>> although I have not personally observed it there yet.
> >>
> >> Is Mistral also mixing eventlet monkeypatching and WSGI?
> >>
> >
> > Looks like there is monkey patching, however we noticed it with the
> > engine/executor. So it's likely not just wsgi. I think I also saw it
> > in the ironic-conductor, though I'd have to try it out again. I'll
> > spin up an undercloud today and see if I can get a more complete list
> > of affected services. It was pretty easy to reproduce.
>
> Okay, I asked because if there's no WSGI/Eventlet combination then this
> may be different from the Nova issue that prompted this thread. It
> sounds like that was being caused by a bad interaction between WSGI and
> some Eventlet timers. If there's no WSGI involved then I wouldn't expect
> that to happen.
>
> I guess we'll see what further investigation turns up, but based on the
> preliminary information there may be two bugs here.
>

So I wasn't able to reproduce the ironic issues yet. But it's the
mistral executor and nova-api which exhibit the issue on the
undercloud.

mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR
oslo.messaging._drivers.impl_rabbit [-]
[b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on
undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104]
Connection reset by peer. Trying again in 1 seconds.:
ConnectionResetError: [Errno 104] Connection reset by peer

nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR
oslo.messaging._drivers.impl_rabbit
[req-d7767aed-e32d-43db-96a8-c0509bfb1cfe
9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe -
default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server on
undercloud-0.ctlplane.localdomain:5672 is unreachable: Server
unexpectedly closed connection. Trying again in 1 seconds.: OSError:
Server unexpectedly closed connection

The errors being thrown are different perhaps it is two different problems.

Correct, I think our original issue with erratic AMQP hearbeats and mod_wsgi

was due to a change in how we run healthcheck in Stein in TripleO-deployed

environments, so different to what Iain originally experienced it seems...

For the record, up to Rocky, we used to run healthcheck scripts
every 30 seconds, which guarantees that eventlet will wake up and
send an AMQP heartbeat packet if a service had no AMQP traffic in the last
15s. It also guarantees that any incoming AMQP heartbeat packet from
rabbitmq will be processed in at most 30s.

In Stein, our healthchecks are now triggered via systemd timers, and the

current time setting is too high to guarantee that mod_wsgi will always

wake up on time to send/receive AMQP heartbeats to/from rabbitmq

when there's no traffic.

The fix is being tracked in https://bugs.launchpad.net/tripleo/+bug/1826281

Thanks,
-Alex

Damien