[oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI
Ken Giusti
kgiusti at gmail.com
Thu Apr 25 18:19:50 UTC 2019
On Thu, Apr 25, 2019 at 11:19 AM <iain.macdonnell at oracle.com> wrote:
>
>
> On 4/25/19 1:05 AM, Damien Ciabrini wrote:
> >
> >
> > On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz at redhat.com
> > <mailto:aschultz at redhat.com>> wrote:
> >
> > On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack at nemebean.com
> > <mailto:openstack at nemebean.com>> wrote:
> > >
> > >
> > >
> > > On 4/22/19 12:53 PM, Alex Schultz wrote:
> > > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec
> > <openstack at nemebean.com <mailto:openstack at nemebean.com>> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 4/20/19 1:38 AM, Michele Baldessari wrote:
> > > >>> On Fri, Apr 19, 2019 at 03:20:44PM -0700,
> > iain.macdonnell at oracle.com <mailto:iain.macdonnell at oracle.com>
> wrote:
> > > >>>>
> > > >>>> Today I discovered that this problem appears to be caused by
> > eventlet
> > > >>>> monkey-patching. I've created a bug for it:
> > > >>>>
> > > >>>> https://bugs.launchpad.net/nova/+bug/1825584
> > <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=fpephoqJ2hzc-jFPr6Rtupc2U02HBjBRd-_Lq66zQBk&s=Vb1Yw7ZkrglH0AMBZSHNfuilS5gBwh9yF2o2trkXYyM&e=
> >
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> just for completeness we see this very same issue also with
> > > >>> mistral (actually it was the first service where we noticed
> > the missed
> > > >>> heartbeats). iirc Alex Schultz mentioned seeing it in ironic
> > as well,
> > > >>> although I have not personally observed it there yet.
> > > >>
> > > >> Is Mistral also mixing eventlet monkeypatching and WSGI?
> > > >>
> > > >
> > > > Looks like there is monkey patching, however we noticed it with
> the
> > > > engine/executor. So it's likely not just wsgi. I think I also
> > saw it
> > > > in the ironic-conductor, though I'd have to try it out again.
> I'll
> > > > spin up an undercloud today and see if I can get a more
> > complete list
> > > > of affected services. It was pretty easy to reproduce.
> > >
> > > Okay, I asked because if there's no WSGI/Eventlet combination
> > then this
> > > may be different from the Nova issue that prompted this thread. It
> > > sounds like that was being caused by a bad interaction between
> > WSGI and
> > > some Eventlet timers. If there's no WSGI involved then I wouldn't
> > expect
> > > that to happen.
> > >
> > > I guess we'll see what further investigation turns up, but based
> > on the
> > > preliminary information there may be two bugs here.
> > >
> >
> > So I wasn't able to reproduce the ironic issues yet. But it's the
> > mistral executor and nova-api which exhibit the issue on the
> > undercloud.
> >
> > mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR
> > oslo.messaging._drivers.impl_rabbit [-]
> > [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on
> > undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104]
> > Connection reset by peer. Trying again in 1 seconds.:
> > ConnectionResetError: [Errno 104] Connection reset by peer
> >
> >
> > nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR
> > oslo.messaging._drivers.impl_rabbit
> > [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe
> > 9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe -
> > default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server
> on
> > undercloud-0.ctlplane.localdomain:5672 is unreachable: Server
> > unexpectedly closed connection. Trying again in 1 seconds.: OSError:
> > Server unexpectedly closed connection
> >
> > The errors being thrown are different perhaps it is two different
> > problems.
> >
> >
> > Correct, I think our original issue with erratic AMQP hearbeats and
> mod_wsgi
> > was due to a change in how we run healthcheck in Stein in
> TripleO-deployed
> > environments, so different to what Iain originally experienced it
> seems...
> >
> > For the record, up to Rocky, we used to run healthcheck scripts
> > every 30 seconds, which guarantees that eventlet will wake up and
> > send an AMQP heartbeat packet if a service had no AMQP traffic in the
> last
> > 15s. It also guarantees that any incoming AMQP heartbeat packet from
> > rabbitmq will be processed in at most 30s.
> >
> > In Stein, our healthchecks are now triggered via systemd timers, and the
> > current time setting is too high to guarantee that mod_wsgi will always
> > wake up on time to send/receive AMQP heartbeats to/from rabbitmq
> > when there's no traffic.
>
> If I'm reading this right, it sounds like the periodic healthcheck is
> working around the underlying issue of the heartbeats not happening by
> themselves (due to eventlet monkey-patching somehow interfering with
> threading). The whole point of the heartbeats is to continually maintain
> the connection while there's no real traffic.
>
> ~iain
>
>
>
I would agree with Iain here - it sounds as if something in mod_wsgi (or
elsewhere) is blocking in the os outside of the purview of eventlet.
FWIW here's the oslo.messaging thread responsible for sending heartbeats
from the RPC client:
https://opendev.org/openstack/oslo.messaging/src/branch/stable/stein/oslo_messaging/_drivers/impl_rabbit.py#L897
Can you verify that the event.wait() call at the bottom of the loop is not
waking up as per the passed in _heartbeat_wait_timeout?
thanks
--
Ken Giusti (kgiusti at gmail.com)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20190425/332a2f91/attachment.html>
More information about the openstack-discuss
mailing list