On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz@redhat.com> wrote:
On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack@nemebean.com> wrote:
On 4/22/19 12:53 PM, Alex Schultz wrote:
On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com>
wrote:
On 4/20/19 1:38 AM, Michele Baldessari wrote:
On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com
wrote:
Today I discovered that this problem appears to be caused by
eventlet
monkey-patching. I've created a bug for it:
Hi,
just for completeness we see this very same issue also with mistral (actually it was the first service where we noticed the missed heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, although I have not personally observed it there yet.
Is Mistral also mixing eventlet monkeypatching and WSGI?
Looks like there is monkey patching, however we noticed it with the engine/executor. So it's likely not just wsgi. I think I also saw it in the ironic-conductor, though I'd have to try it out again. I'll spin up an undercloud today and see if I can get a more complete list of affected services. It was pretty easy to reproduce.
Okay, I asked because if there's no WSGI/Eventlet combination then this may be different from the Nova issue that prompted this thread. It sounds like that was being caused by a bad interaction between WSGI and some Eventlet timers. If there's no WSGI involved then I wouldn't expect that to happen.
I guess we'll see what further investigation turns up, but based on the preliminary information there may be two bugs here.
So I wasn't able to reproduce the ironic issues yet. But it's the mistral executor and nova-api which exhibit the issue on the undercloud.
mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR oslo.messaging._drivers.impl_rabbit [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe 9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe - default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
The errors being thrown are different perhaps it is two different problems.
Correct, I think our original issue with erratic AMQP hearbeats and mod_wsgi was due to a change in how we run healthcheck in Stein in TripleO-deployed environments, so different to what Iain originally experienced it seems... For the record, up to Rocky, we used to run healthcheck scripts every 30 seconds, which guarantees that eventlet will wake up and send an AMQP heartbeat packet if a service had no AMQP traffic in the last 15s. It also guarantees that any incoming AMQP heartbeat packet from rabbitmq will be processed in at most 30s. In Stein, our healthchecks are now triggered via systemd timers, and the current time setting is too high to guarantee that mod_wsgi will always wake up on time to send/receive AMQP heartbeats to/from rabbitmq when there's no traffic. The fix is being tracked in https://bugs.launchpad.net/tripleo/+bug/1826281
Thanks, -Alex
-- Damien