[oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI
iain.macdonnell at oracle.com
iain.macdonnell at oracle.com
Fri Apr 19 06:27:37 UTC 2019
Hi Damien,
Thanks for the response! This definitely sounds similar enough that it
has to be the same root problem.
I didn't make too much progress today. Whilst using strace to try to
compare Rocky and Stein, I noticed that TCP connections seem to be
handled somewhat differently. This applies to mysql as well as AMQP. In
the Rocky case, I see repeated fcntl() calls to F_SETFL O_RDWR between
sendto() and recvfrom():
[pid 32427] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 19
[pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR)
[pid 32427] fcntl(19, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 32427] connect(19, {sa_family=AF_INET, sin_port=htons(3306),
sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in
progress)
[pid 32427] poll([{fd=19, events=POLLOUT}], 1, 10000) = 1 ([{fd=19,
revents=POLLOUT}])
[pid 32427] getsockopt(19, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid 32427] setsockopt(19, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 32427] fcntl(19, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0
[pid 32427] setsockopt(19, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR)
[pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0
[pid 32427] recvfrom(19,
"J\0\0\0\n5.7.25\0\f\r\0\0#\4;;\n)\0309\0\377\377\10\2\0\377\301"...,
8192, 0, NULL, NULL) = 78
[pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR)
[pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0
[pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR)
[pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0
[pid 32427] sendto(19,
"\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
150, 0, NULL, 0) = 150
[pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR)
[pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0
[pid 32427] recvfrom(19, ",\0\0\2\376mysql_native_password\0#\4;;\n"...,
8192, 0, NULL, NULL) = 48
whereas in the Stein case, I see recvfrom() calls that get EAGAIN,
followed by epoll before a successful recvfrom():
[pid 17895] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 28
[pid 17895] fcntl(28, F_GETFL) = 0x2 (flags O_RDWR)
[pid 17895] fcntl(28, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306),
sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in
progress)
[pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28,
{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0
[pid 17895] epoll_wait(26, [{EPOLLOUT, {u32=28,
u64=6398156018563416092}}], 1023, 1992) = 1
[pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f7440) = 0
[pid 17895] getsockopt(28, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306),
sin_addr=inet_addr("10.241.154.33")}, 16) = 0
[pid 17895] setsockopt(28, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 17895] setsockopt(28, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 17895] recvfrom(28,
"J\0\0\0\n5.7.25\0(!\0\0\10w@\v-~)D\0\377\377\10\2\0\377\301"..., 8192,
0, NULL, NULL) = 78
[pid 17895] sendto(28,
"\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
150, 0, NULL, 0) = 150
[pid 17895] recvfrom(28, 0x7fb5d401e340, 8192, 0, NULL, NULL) = -1
EAGAIN (Resource temporarily unavailable)
[pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28,
{EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0
[pid 17895] epoll_wait(26, [{EPOLLIN, {u32=28,
u64=6398156018563416092}}], 1023, 1991) = 1
[pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f6530) = 0
[pid 17895] recvfrom(28,
",\0\0\2\376mysql_native_password\0\10w@\v-"..., 8192, 0, NULL, NULL) = 48
Not sure if this is related to the problem or not, but it's what jumped
out first, and I wasn't able to figure out what's causing the change in
behaviour before I ran out of enthusiasm today.
Please let me know if you discover anything further or/and if there are
any discussions (IRC?) that I should be participating in .. although I'm
guessing that that's happening during the EU work day.
Thanks!
~iain
On 4/18/19 12:48 AM, Damien Ciabrini wrote:
>
>
> On Thu, Apr 18, 2019 at 7:51 AM <iain.macdonnell at oracle.com
> <mailto:iain.macdonnell at oracle.com>> wrote:
>
>
> Is anyone running Stein nova-api under uWSGI? I'm trying to do this, in
> exactly the same way as I did with Rocky, but encountering a new
> problem...
>
>
> This sounds similar to a problem that we're currently debugging with John
> Eckersberg and Michele Baldessari (in CC) in Stein, but when nova-api runs
> under Apache mod_wsgi.
>
> The symptoms for us seem similar: sometimes the scheduling of the AMQP
> heartbeat greenlet seems heratic (no run for 75 seconds), RabbitMQ
> drops the connection to the client during this idle period, and next time
> nova-api needs to communicate over AMQP it logs a "connection reset
> by peer" and reestablish a new connection.
>
> This is fresh news from yesterday so this is still to be
> confirmed/disproved;
> but currently we're investigating how eventlet schedules recurring tasks
> such as the AMQP heartbeat greenlet under Apache mod_wsgi's execution
> model.
>
> Once the AMQP heartbeat greenlet has sent its heartbeat, it sleeps for a
> fixed amount of time; i.e. it suspends its execution, eventlet registers
> a timer
> for this greenlet, and continues scheduling other greenlets. Once the timer
> expires, eventlet marks the heartbeat greenlet as "eligible for scheduling"
> and eventually this greenlet resumes its execution.
> For that mechanism to work, eventlet regularly watches for all the timers
> that are registered in its event loop (with select/epoll/whatnot).
>
> Now, Apache mod_wsgi embeds its own Python interpreter, and calls it
> whenever there's an API call to process. It seems that - the precise course
> of event is unclear to me yet - once the API call is processed, the Python
> Interpreter returns a result to mod_wsgi and stops its execution even if
> some timers were still registered in eventlet.
>
> The state of the Python Interpreter is conserved though, and next time
> mod_wsgi receives an API call, it resumes the execution of the Python
> Interpreter: eventlet resumes its execution and process every timer
> that were registered previously. So ultimately, the AMQP heartbeat
> greenlet is rescheduled, but that might be a long time already since the
> last AMQP heartbeat.
>
> So I have the feeling that when there's no traffic, mod_wsgi doesn't execute
> the Python interpreter, which prevents eventlet from running, watching for
> timers, and triggering the AMQP heartbeat greenlet. So far, the evidences
> in favour of this theory are:
> . When there's no traffic for nova-api running under mod_wsgi, and I use
> gdb to dump the call stack of the mod_wsgi process, I see no CPython
> interpreter in the stack. Whereas when there's traffic, I see many
> CPython
> stack frames.
> . When there's no traffic to nova-api and the AMQP heartbeat greenlet
> seem
> to be unscheduled for a long time, running a simple 'nova list' command
> seems to unblock everything and trigger a AMQP heartbeat.
>
> This would also explain why running the nova-api service in standalone works
> just fine as you said. because the Python interpreter is never stopped
> in this
> execution model.
>
> I'm still trying to figure out when the Python interpreter returns to
> mod_wsgi
> to make sure what I'm saying makes sense. And I also need to double check
> that running in standalone works fine for me as well.
>
> The symptom is intermittent failure of API operations that involve RPC
> calls - notably getting the noVNC console URL for an instance. The
> nova-api log shows RPC timeouts and "connection reset by peer" for
> AMQP.
> I've ruled out environmental issues, as a Rocky installation on the
> same
> hosts works fine. With oslo debug enabled, I see that the first API
> call
> causes a AMQP connections to be established, and the RPC call gets its
> reply, and all is well... but there are no heartbeats logged after
> that.
> 3 minutes later, rabbitmq-server reports that it has received no
> heartbeats, and drops the connection. On a subsequent API call (but not
> before), nova-api logging "connection reset by peer" and the RPC call
> times out. With the Rocky code, once the first API call causes the AMQP
> connections to be established, regular heartbeats start happening, the
> connections are maintained, and subsequent API calls work as expected.
>
> It seems like a lockup in a thread that's supposed to maintain the AMQP
> connection, or something. I tried to use SIGUSR2 to trigger guru
> meditation, but it seems that uWSGI has its own interpretation of
> SIGUSR2, so nova-api never sees it (anyone know a way around this?).
>
> Running nova-api as a standalone service, instead of under uWSGI, the
> problem does not occur (I see the heartbeats, and everything works
> fine).
>
> I've tried updating oslo-messaging (and python-amqp) on a Rocky
> installation to the Stein version, but the problem does not reproduce.
>
> Ideas?
>
> ~iain
>
>
>
>
>
> --
> Damien
More information about the openstack-discuss
mailing list