On Thu, Apr 18, 2019 at 7:51 AM <iain.macdonnell@oracle.com> wrote:

Is anyone running Stein nova-api under uWSGI? I'm trying to do this, in
exactly the same way as I did with Rocky, but encountering a new problem...


This sounds similar to a problem that we're currently debugging with John
Eckersberg and Michele Baldessari (in CC) in Stein, but when nova-api runs
under Apache mod_wsgi.

The symptoms for us seem similar: sometimes the scheduling of the AMQP
heartbeat greenlet seems heratic (no run for 75 seconds), RabbitMQ
drops the connection to the client during this idle period, and next time
nova-api needs to communicate over AMQP it logs a "connection reset
by peer" and reestablish a new connection.

This is fresh news from yesterday so this is still to be confirmed/disproved;
but currently we're investigating how eventlet schedules recurring tasks
such as the AMQP heartbeat greenlet under Apache mod_wsgi's execution
model.

Once the AMQP heartbeat greenlet has sent its heartbeat, it sleeps for a
fixed amount of time; i.e. it suspends its execution, eventlet registers a timer
for this greenlet, and continues scheduling other greenlets. Once the timer
expires, eventlet marks the heartbeat greenlet as "eligible for scheduling"
and eventually this greenlet resumes its execution.
For that mechanism to work, eventlet regularly watches for all the timers
that are registered in its event loop (with select/epoll/whatnot).

Now, Apache mod_wsgi embeds its own Python interpreter, and calls it
whenever there's an API call to process. It seems that - the precise course
of event is unclear to me yet - once the API call is processed, the Python
Interpreter returns a result to mod_wsgi and stops its execution even if
some timers were still registered in eventlet.

The state of the Python Interpreter is conserved though, and next time
mod_wsgi receives an API call, it resumes the execution of the Python
Interpreter: eventlet resumes its execution and process every timer
that were registered previously. So ultimately, the AMQP heartbeat
greenlet is rescheduled, but that might be a long time already since the
last AMQP heartbeat.

So I have the feeling that when there's no traffic, mod_wsgi doesn't execute
the Python interpreter, which prevents eventlet from running, watching for
timers, and triggering the AMQP heartbeat greenlet. So far, the evidences
in favour of this theory are:
  . When there's no traffic for nova-api running under mod_wsgi, and I use
    gdb to dump the call stack of the mod_wsgi process, I see no CPython
    interpreter in the stack. Whereas when there's traffic, I see many CPython
    stack frames.
  . When there's no traffic to nova-api and the AMQP heartbeat greenlet seem
    to be unscheduled for a long time, running a simple 'nova list' command
    seems to unblock everything and trigger a AMQP heartbeat.

This would also explain why running the nova-api service in standalone works
just fine as you said. because the Python interpreter is never stopped in this
execution model.

I'm still trying to figure out when the Python interpreter returns to mod_wsgi
to make sure what I'm saying makes sense. And I also need to double check
that running in standalone works fine for me as well.

The symptom is intermittent failure of API operations that involve RPC
calls - notably getting the noVNC console URL for an instance. The
nova-api log shows RPC timeouts and "connection reset by peer" for AMQP.
I've ruled out environmental issues, as a Rocky installation on the same
hosts works fine. With oslo debug enabled, I see that the first API call
causes a AMQP connections to be established, and the RPC call gets its
reply, and all is well... but there are no heartbeats logged after that.
3 minutes later, rabbitmq-server reports that it has received no
heartbeats, and drops the connection. On a subsequent API call (but not
before), nova-api logging "connection reset by peer" and the RPC call
times out. With the Rocky code, once the first API call causes the AMQP
connections to be established, regular heartbeats start happening, the
connections are maintained, and subsequent API calls work as expected.

It seems like a lockup in a thread that's supposed to maintain the AMQP
connection, or something. I tried to use SIGUSR2 to trigger guru
meditation, but it seems that uWSGI has its own interpretation of
SIGUSR2, so nova-api never sees it (anyone know a way around this?).

Running nova-api as a standalone service, instead of under uWSGI, the
problem does not occur (I see the heartbeats, and everything works fine).

I've tried updating oslo-messaging (and python-amqp) on a Rocky
installation to the Stein version, but the problem does not reproduce.

Ideas?

     ~iain





--
Damien