Today I discovered that this problem appears to be caused by eventlet monkey-patching. I've created a bug for it: https://bugs.launchpad.net/nova/+bug/1825584 ~iain On 4/18/19 11:27 PM, iain.macdonnell@oracle.com wrote:
Hi Damien,
Thanks for the response! This definitely sounds similar enough that it has to be the same root problem.
I didn't make too much progress today. Whilst using strace to try to compare Rocky and Stein, I noticed that TCP connections seem to be handled somewhat differently. This applies to mysql as well as AMQP. In the Rocky case, I see repeated fcntl() calls to F_SETFL O_RDWR between sendto() and recvfrom():
[pid 32427] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 19 [pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 32427] connect(19, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 32427] poll([{fd=19, events=POLLOUT}], 1, 10000) = 1 ([{fd=19, revents=POLLOUT}]) [pid 32427] getsockopt(19, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 [pid 32427] setsockopt(19, SOL_TCP, TCP_NODELAY, [1], 4) = 0 [pid 32427] fcntl(19, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] setsockopt(19, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 [pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] recvfrom(19, "J\0\0\0\n5.7.25\0\f\r\0\0#\4;;\n)\0309\0\377\377\10\2\0\377\301"..., 8192, 0, NULL, NULL) = 78 [pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] sendto(19, "\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 150, 0, NULL, 0) = 150 [pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] recvfrom(19, ",\0\0\2\376mysql_native_password\0#\4;;\n"..., 8192, 0, NULL, NULL) = 48
whereas in the Stein case, I see recvfrom() calls that get EAGAIN, followed by epoll before a successful recvfrom():
[pid 17895] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 28 [pid 17895] fcntl(28, F_GETFL) = 0x2 (flags O_RDWR) [pid 17895] fcntl(28, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0 [pid 17895] epoll_wait(26, [{EPOLLOUT, {u32=28, u64=6398156018563416092}}], 1023, 1992) = 1 [pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f7440) = 0 [pid 17895] getsockopt(28, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 [pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("10.241.154.33")}, 16) = 0 [pid 17895] setsockopt(28, SOL_TCP, TCP_NODELAY, [1], 4) = 0 [pid 17895] setsockopt(28, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 [pid 17895] recvfrom(28, "J\0\0\0\n5.7.25\0(!\0\0\10w@\v-~)D\0\377\377\10\2\0\377\301"..., 8192, 0, NULL, NULL) = 78 [pid 17895] sendto(28, "\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 150, 0, NULL, 0) = 150 [pid 17895] recvfrom(28, 0x7fb5d401e340, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0 [pid 17895] epoll_wait(26, [{EPOLLIN, {u32=28, u64=6398156018563416092}}], 1023, 1991) = 1 [pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f6530) = 0 [pid 17895] recvfrom(28, ",\0\0\2\376mysql_native_password\0\10w@\v-"..., 8192, 0, NULL, NULL) = 48
Not sure if this is related to the problem or not, but it's what jumped out first, and I wasn't able to figure out what's causing the change in behaviour before I ran out of enthusiasm today.
Please let me know if you discover anything further or/and if there are any discussions (IRC?) that I should be participating in .. although I'm guessing that that's happening during the EU work day.
Thanks!
~iain
On 4/18/19 12:48 AM, Damien Ciabrini wrote:
On Thu, Apr 18, 2019 at 7:51 AM <iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com>> wrote:
Is anyone running Stein nova-api under uWSGI? I'm trying to do this, in exactly the same way as I did with Rocky, but encountering a new problem...
This sounds similar to a problem that we're currently debugging with John Eckersberg and Michele Baldessari (in CC) in Stein, but when nova-api runs under Apache mod_wsgi.
The symptoms for us seem similar: sometimes the scheduling of the AMQP heartbeat greenlet seems heratic (no run for 75 seconds), RabbitMQ drops the connection to the client during this idle period, and next time nova-api needs to communicate over AMQP it logs a "connection reset by peer" and reestablish a new connection.
This is fresh news from yesterday so this is still to be confirmed/disproved; but currently we're investigating how eventlet schedules recurring tasks such as the AMQP heartbeat greenlet under Apache mod_wsgi's execution model.
Once the AMQP heartbeat greenlet has sent its heartbeat, it sleeps for a fixed amount of time; i.e. it suspends its execution, eventlet registers a timer for this greenlet, and continues scheduling other greenlets. Once the timer expires, eventlet marks the heartbeat greenlet as "eligible for scheduling" and eventually this greenlet resumes its execution. For that mechanism to work, eventlet regularly watches for all the timers that are registered in its event loop (with select/epoll/whatnot).
Now, Apache mod_wsgi embeds its own Python interpreter, and calls it whenever there's an API call to process. It seems that - the precise course of event is unclear to me yet - once the API call is processed, the Python Interpreter returns a result to mod_wsgi and stops its execution even if some timers were still registered in eventlet.
The state of the Python Interpreter is conserved though, and next time mod_wsgi receives an API call, it resumes the execution of the Python Interpreter: eventlet resumes its execution and process every timer that were registered previously. So ultimately, the AMQP heartbeat greenlet is rescheduled, but that might be a long time already since the last AMQP heartbeat.
So I have the feeling that when there's no traffic, mod_wsgi doesn't execute the Python interpreter, which prevents eventlet from running, watching for timers, and triggering the AMQP heartbeat greenlet. So far, the evidences in favour of this theory are: . When there's no traffic for nova-api running under mod_wsgi, and I use gdb to dump the call stack of the mod_wsgi process, I see no CPython interpreter in the stack. Whereas when there's traffic, I see many CPython stack frames. . When there's no traffic to nova-api and the AMQP heartbeat greenlet seem to be unscheduled for a long time, running a simple 'nova list' command seems to unblock everything and trigger a AMQP heartbeat.
This would also explain why running the nova-api service in standalone works just fine as you said. because the Python interpreter is never stopped in this execution model.
I'm still trying to figure out when the Python interpreter returns to mod_wsgi to make sure what I'm saying makes sense. And I also need to double check that running in standalone works fine for me as well.
The symptom is intermittent failure of API operations that involve RPC calls - notably getting the noVNC console URL for an instance. The nova-api log shows RPC timeouts and "connection reset by peer" for AMQP. I've ruled out environmental issues, as a Rocky installation on the same hosts works fine. With oslo debug enabled, I see that the first API call causes a AMQP connections to be established, and the RPC call gets its reply, and all is well... but there are no heartbeats logged after that. 3 minutes later, rabbitmq-server reports that it has received no heartbeats, and drops the connection. On a subsequent API call (but not before), nova-api logging "connection reset by peer" and the RPC call times out. With the Rocky code, once the first API call causes the AMQP connections to be established, regular heartbeats start happening, the connections are maintained, and subsequent API calls work as expected.
It seems like a lockup in a thread that's supposed to maintain the AMQP connection, or something. I tried to use SIGUSR2 to trigger guru meditation, but it seems that uWSGI has its own interpretation of SIGUSR2, so nova-api never sees it (anyone know a way around this?).
Running nova-api as a standalone service, instead of under uWSGI, the problem does not occur (I see the heartbeats, and everything works fine).
I've tried updating oslo-messaging (and python-amqp) on a Rocky installation to the Stein version, but the problem does not reproduce.
Ideas?
~iain
-- Damien