Open Stack

Fri Apr 19 06:27:37 UTC 2019

Hi Damien,

Thanks for the response! This definitely sounds similar enough that it 
has to be the same root problem.

I didn't make too much progress today. Whilst using strace to try to 
compare Rocky and Stein, I noticed that TCP connections seem to be 
handled somewhat differently. This applies to mysql as well as AMQP. In 
the Rocky case, I see repeated fcntl() calls to F_SETFL O_RDWR between 
sendto() and recvfrom():

[pid 32427] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 19
[pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR)
[pid 32427] fcntl(19, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 32427] connect(19, {sa_family=AF_INET, sin_port=htons(3306), 
sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
[pid 32427] poll([{fd=19, events=POLLOUT}], 1, 10000) = 1 ([{fd=19, 
revents=POLLOUT}])
[pid 32427] getsockopt(19, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid 32427] setsockopt(19, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 32427] fcntl(19, F_GETFL)          = 0x802 (flags O_RDWR|O_NONBLOCK)
[pid 32427] fcntl(19, F_SETFL, O_RDWR)  = 0
[pid 32427] setsockopt(19, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR)
[pid 32427] fcntl(19, F_SETFL, O_RDWR)  = 0
[pid 32427] recvfrom(19, 
"J\0\0\0\n5.7.25\0\f\r\0\0#\4;;\n)\0309\0\377\377\10\2\0\377\301"..., 
8192, 0, NULL, NULL) = 78
[pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR)
[pid 32427] fcntl(19, F_SETFL, O_RDWR)  = 0
[pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR)
[pid 32427] fcntl(19, F_SETFL, O_RDWR)  = 0
[pid 32427] sendto(19, 
"\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
150, 0, NULL, 0) = 150
[pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR)
[pid 32427] fcntl(19, F_SETFL, O_RDWR)  = 0
[pid 32427] recvfrom(19, ",\0\0\2\376mysql_native_password\0#\4;;\n"..., 
8192, 0, NULL, NULL) = 48

whereas in the Stein case, I see recvfrom() calls that get EAGAIN, 
followed by epoll before a successful recvfrom():

[pid 17895] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 28
[pid 17895] fcntl(28, F_GETFL)          = 0x2 (flags O_RDWR)
[pid 17895] fcntl(28, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306), 
sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
[pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28, 
{EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0
[pid 17895] epoll_wait(26, [{EPOLLOUT, {u32=28, 
u64=6398156018563416092}}], 1023, 1992) = 1
[pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f7440) = 0
[pid 17895] getsockopt(28, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
[pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306), 
sin_addr=inet_addr("10.241.154.33")}, 16) = 0
[pid 17895] setsockopt(28, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid 17895] setsockopt(28, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 17895] recvfrom(28, 
"J\0\0\0\n5.7.25\0(!\0\0\10w@\v-~)D\0\377\377\10\2\0\377\301"..., 8192, 
0, NULL, NULL) = 78
[pid 17895] sendto(28, 
"\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 
150, 0, NULL, 0) = 150
[pid 17895] recvfrom(28, 0x7fb5d401e340, 8192, 0, NULL, NULL) = -1 
EAGAIN (Resource temporarily unavailable)
[pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28, 
{EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0
[pid 17895] epoll_wait(26, [{EPOLLIN, {u32=28, 
u64=6398156018563416092}}], 1023, 1991) = 1
[pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f6530) = 0
[pid 17895] recvfrom(28, 
",\0\0\2\376mysql_native_password\0\10w@\v-"..., 8192, 0, NULL, NULL) = 48

Not sure if this is related to the problem or not, but it's what jumped 
out first, and I wasn't able to figure out what's causing the change in 
behaviour before I ran out of enthusiasm today.

Please let me know if you discover anything further or/and if there are 
any discussions (IRC?) that I should be participating in .. although I'm 
guessing that that's happening during the EU work day.

Thanks!

     ~iain

On 4/18/19 12:48 AM, Damien Ciabrini wrote:
> 
> 
> On Thu, Apr 18, 2019 at 7:51 AM <iain.macdonnell at oracle.com 
> <mailto:iain.macdonnell at oracle.com>> wrote:
> 
> 
>     Is anyone running Stein nova-api under uWSGI? I'm trying to do this, in
>     exactly the same way as I did with Rocky, but encountering a new
>     problem...
> 
> 
> This sounds similar to a problem that we're currently debugging with John
> Eckersberg and Michele Baldessari (in CC) in Stein, but when nova-api runs
> under Apache mod_wsgi.
> 
> The symptoms for us seem similar: sometimes the scheduling of the AMQP
> heartbeat greenlet seems heratic (no run for 75 seconds), RabbitMQ
> drops the connection to the client during this idle period, and next time
> nova-api needs to communicate over AMQP it logs a "connection reset
> by peer" and reestablish a new connection.
> 
> This is fresh news from yesterday so this is still to be 
> confirmed/disproved;
> but currently we're investigating how eventlet schedules recurring tasks
> such as the AMQP heartbeat greenlet under Apache mod_wsgi's execution
> model.
> 
> Once the AMQP heartbeat greenlet has sent its heartbeat, it sleeps for a
> fixed amount of time; i.e. it suspends its execution, eventlet registers 
> a timer
> for this greenlet, and continues scheduling other greenlets. Once the timer
> expires, eventlet marks the heartbeat greenlet as "eligible for scheduling"
> and eventually this greenlet resumes its execution.
> For that mechanism to work, eventlet regularly watches for all the timers
> that are registered in its event loop (with select/epoll/whatnot).
> 
> Now, Apache mod_wsgi embeds its own Python interpreter, and calls it
> whenever there's an API call to process. It seems that - the precise course
> of event is unclear to me yet - once the API call is processed, the Python
> Interpreter returns a result to mod_wsgi and stops its execution even if
> some timers were still registered in eventlet.
> 
> The state of the Python Interpreter is conserved though, and next time
> mod_wsgi receives an API call, it resumes the execution of the Python
> Interpreter: eventlet resumes its execution and process every timer
> that were registered previously. So ultimately, the AMQP heartbeat
> greenlet is rescheduled, but that might be a long time already since the
> last AMQP heartbeat.
> 
> So I have the feeling that when there's no traffic, mod_wsgi doesn't execute
> the Python interpreter, which prevents eventlet from running, watching for
> timers, and triggering the AMQP heartbeat greenlet. So far, the evidences
> in favour of this theory are:
>    . When there's no traffic for nova-api running under mod_wsgi, and I use
>      gdb to dump the call stack of the mod_wsgi process, I see no CPython
>      interpreter in the stack. Whereas when there's traffic, I see many 
> CPython
>      stack frames.
>    . When there's no traffic to nova-api and the AMQP heartbeat greenlet 
> seem
>      to be unscheduled for a long time, running a simple 'nova list' command
>      seems to unblock everything and trigger a AMQP heartbeat.
> 
> This would also explain why running the nova-api service in standalone works
> just fine as you said. because the Python interpreter is never stopped 
> in this
> execution model.
> 
> I'm still trying to figure out when the Python interpreter returns to 
> mod_wsgi
> to make sure what I'm saying makes sense. And I also need to double check
> that running in standalone works fine for me as well.
> 
>     The symptom is intermittent failure of API operations that involve RPC
>     calls - notably getting the noVNC console URL for an instance. The
>     nova-api log shows RPC timeouts and "connection reset by peer" for
>     AMQP.
>     I've ruled out environmental issues, as a Rocky installation on the
>     same
>     hosts works fine. With oslo debug enabled, I see that the first API
>     call
>     causes a AMQP connections to be established, and the RPC call gets its
>     reply, and all is well... but there are no heartbeats logged after
>     that.
>     3 minutes later, rabbitmq-server reports that it has received no
>     heartbeats, and drops the connection. On a subsequent API call (but not
>     before), nova-api logging "connection reset by peer" and the RPC call
>     times out. With the Rocky code, once the first API call causes the AMQP
>     connections to be established, regular heartbeats start happening, the
>     connections are maintained, and subsequent API calls work as expected.
> 
>     It seems like a lockup in a thread that's supposed to maintain the AMQP
>     connection, or something. I tried to use SIGUSR2 to trigger guru
>     meditation, but it seems that uWSGI has its own interpretation of
>     SIGUSR2, so nova-api never sees it (anyone know a way around this?).
> 
>     Running nova-api as a standalone service, instead of under uWSGI, the
>     problem does not occur (I see the heartbeats, and everything works
>     fine).
> 
>     I've tried updating oslo-messaging (and python-amqp) on a Rocky
>     installation to the Stein version, but the problem does not reproduce.
> 
>     Ideas?
> 
>           ~iain
> 
> 
> 
> 
> 
> -- 
> Damien

Open Stack

[oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI

OpenStack

Community

Documentation

Branding & Legal