[oslo][oslo-messaging][nova] Stein nova-api AMQP issue running under uWSGI

newer
[Tacker] Train vPTG meetup schedule

older
[scientific-sig] IRC Meeting today...

iain.macdonnell＠oracle.com

18 Apr 2019 18 Apr '19

5:48 a.m.

Is anyone running Stein nova-api under uWSGI? I'm trying to do this, in exactly the same way as I did with Rocky, but encountering a new problem... The symptom is intermittent failure of API operations that involve RPC calls - notably getting the noVNC console URL for an instance. The nova-api log shows RPC timeouts and "connection reset by peer" for AMQP. I've ruled out environmental issues, as a Rocky installation on the same hosts works fine. With oslo debug enabled, I see that the first API call causes a AMQP connections to be established, and the RPC call gets its reply, and all is well... but there are no heartbeats logged after that. 3 minutes later, rabbitmq-server reports that it has received no heartbeats, and drops the connection. On a subsequent API call (but not before), nova-api logging "connection reset by peer" and the RPC call times out. With the Rocky code, once the first API call causes the AMQP connections to be established, regular heartbeats start happening, the connections are maintained, and subsequent API calls work as expected. It seems like a lockup in a thread that's supposed to maintain the AMQP connection, or something. I tried to use SIGUSR2 to trigger guru meditation, but it seems that uWSGI has its own interpretation of SIGUSR2, so nova-api never sees it (anyone know a way around this?). Running nova-api as a standalone service, instead of under uWSGI, the problem does not occur (I see the heartbeats, and everything works fine). I've tried updating oslo-messaging (and python-amqp) on a Rocky installation to the Stein version, but the problem does not reproduce. Ideas? ~iain

Show replies by date

Damien Ciabrini

18 Apr 18 Apr

7:48 a.m.

On Thu, Apr 18, 2019 at 7:51 AM <iain.macdonnell@oracle.com> wrote:

...

Is anyone running Stein nova-api under uWSGI? I'm trying to do this, in exactly the same way as I did with Rocky, but encountering a new problem...

This sounds similar to a problem that we're currently debugging with John Eckersberg and Michele Baldessari (in CC) in Stein, but when nova-api runs under Apache mod_wsgi. The symptoms for us seem similar: sometimes the scheduling of the AMQP heartbeat greenlet seems heratic (no run for 75 seconds), RabbitMQ drops the connection to the client during this idle period, and next time nova-api needs to communicate over AMQP it logs a "connection reset by peer" and reestablish a new connection. This is fresh news from yesterday so this is still to be confirmed/disproved; but currently we're investigating how eventlet schedules recurring tasks such as the AMQP heartbeat greenlet under Apache mod_wsgi's execution model. Once the AMQP heartbeat greenlet has sent its heartbeat, it sleeps for a fixed amount of time; i.e. it suspends its execution, eventlet registers a timer for this greenlet, and continues scheduling other greenlets. Once the timer expires, eventlet marks the heartbeat greenlet as "eligible for scheduling" and eventually this greenlet resumes its execution. For that mechanism to work, eventlet regularly watches for all the timers that are registered in its event loop (with select/epoll/whatnot). Now, Apache mod_wsgi embeds its own Python interpreter, and calls it whenever there's an API call to process. It seems that - the precise course of event is unclear to me yet - once the API call is processed, the Python Interpreter returns a result to mod_wsgi and stops its execution even if some timers were still registered in eventlet. The state of the Python Interpreter is conserved though, and next time mod_wsgi receives an API call, it resumes the execution of the Python Interpreter: eventlet resumes its execution and process every timer that were registered previously. So ultimately, the AMQP heartbeat greenlet is rescheduled, but that might be a long time already since the last AMQP heartbeat. So I have the feeling that when there's no traffic, mod_wsgi doesn't execute the Python interpreter, which prevents eventlet from running, watching for timers, and triggering the AMQP heartbeat greenlet. So far, the evidences in favour of this theory are: . When there's no traffic for nova-api running under mod_wsgi, and I use gdb to dump the call stack of the mod_wsgi process, I see no CPython interpreter in the stack. Whereas when there's traffic, I see many CPython stack frames. . When there's no traffic to nova-api and the AMQP heartbeat greenlet seem to be unscheduled for a long time, running a simple 'nova list' command seems to unblock everything and trigger a AMQP heartbeat. This would also explain why running the nova-api service in standalone works just fine as you said. because the Python interpreter is never stopped in this execution model. I'm still trying to figure out when the Python interpreter returns to mod_wsgi to make sure what I'm saying makes sense. And I also need to double check that running in standalone works fine for me as well. The symptom is intermittent failure of API operations that involve RPC

...

calls - notably getting the noVNC console URL for an instance. The nova-api log shows RPC timeouts and "connection reset by peer" for AMQP. I've ruled out environmental issues, as a Rocky installation on the same hosts works fine. With oslo debug enabled, I see that the first API call causes a AMQP connections to be established, and the RPC call gets its reply, and all is well... but there are no heartbeats logged after that. 3 minutes later, rabbitmq-server reports that it has received no heartbeats, and drops the connection. On a subsequent API call (but not before), nova-api logging "connection reset by peer" and the RPC call times out. With the Rocky code, once the first API call causes the AMQP connections to be established, regular heartbeats start happening, the connections are maintained, and subsequent API calls work as expected.

It seems like a lockup in a thread that's supposed to maintain the AMQP connection, or something. I tried to use SIGUSR2 to trigger guru meditation, but it seems that uWSGI has its own interpretation of SIGUSR2, so nova-api never sees it (anyone know a way around this?).

Running nova-api as a standalone service, instead of under uWSGI, the problem does not occur (I see the heartbeats, and everything works fine).

I've tried updating oslo-messaging (and python-amqp) on a Rocky installation to the Stein version, but the problem does not reproduce.

Ideas?

~iain

-- Damien

iain.macdonnell＠oracle.com

19 Apr 19 Apr

6:27 a.m.

Hi Damien, Thanks for the response! This definitely sounds similar enough that it has to be the same root problem. I didn't make too much progress today. Whilst using strace to try to compare Rocky and Stein, I noticed that TCP connections seem to be handled somewhat differently. This applies to mysql as well as AMQP. In the Rocky case, I see repeated fcntl() calls to F_SETFL O_RDWR between sendto() and recvfrom(): [pid 32427] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 19 [pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 32427] connect(19, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 32427] poll([{fd=19, events=POLLOUT}], 1, 10000) = 1 ([{fd=19, revents=POLLOUT}]) [pid 32427] getsockopt(19, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 [pid 32427] setsockopt(19, SOL_TCP, TCP_NODELAY, [1], 4) = 0 [pid 32427] fcntl(19, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] setsockopt(19, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 [pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] recvfrom(19, "J\0\0\0\n5.7.25\0\f\r\0\0#\4;;\n)\0309\0\377\377\10\2\0\377\301"..., 8192, 0, NULL, NULL) = 78 [pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] sendto(19, "\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 150, 0, NULL, 0) = 150 [pid 32427] fcntl(19, F_GETFL) = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] recvfrom(19, ",\0\0\2\376mysql_native_password\0#\4;;\n"..., 8192, 0, NULL, NULL) = 48 whereas in the Stein case, I see recvfrom() calls that get EAGAIN, followed by epoll before a successful recvfrom(): [pid 17895] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 28 [pid 17895] fcntl(28, F_GETFL) = 0x2 (flags O_RDWR) [pid 17895] fcntl(28, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0 [pid 17895] epoll_wait(26, [{EPOLLOUT, {u32=28, u64=6398156018563416092}}], 1023, 1992) = 1 [pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f7440) = 0 [pid 17895] getsockopt(28, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 [pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("10.241.154.33")}, 16) = 0 [pid 17895] setsockopt(28, SOL_TCP, TCP_NODELAY, [1], 4) = 0 [pid 17895] setsockopt(28, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 [pid 17895] recvfrom(28, "J\0\0\0\n5.7.25\0(!\0\0\10w@\v-~)D\0\377\377\10\2\0\377\301"..., 8192, 0, NULL, NULL) = 78 [pid 17895] sendto(28, "\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 150, 0, NULL, 0) = 150 [pid 17895] recvfrom(28, 0x7fb5d401e340, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0 [pid 17895] epoll_wait(26, [{EPOLLIN, {u32=28, u64=6398156018563416092}}], 1023, 1991) = 1 [pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f6530) = 0 [pid 17895] recvfrom(28, ",\0\0\2\376mysql_native_password\0\10w@\v-"..., 8192, 0, NULL, NULL) = 48 Not sure if this is related to the problem or not, but it's what jumped out first, and I wasn't able to figure out what's causing the change in behaviour before I ran out of enthusiasm today. Please let me know if you discover anything further or/and if there are any discussions (IRC?) that I should be participating in .. although I'm guessing that that's happening during the EU work day. Thanks! ~iain On 4/18/19 12:48 AM, Damien Ciabrini wrote:

...

On Thu, Apr 18, 2019 at 7:51 AM <iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com>> wrote:

Is anyone running Stein nova-api under uWSGI? I'm trying to do this, in exactly the same way as I did with Rocky, but encountering a new problem...

This sounds similar to a problem that we're currently debugging with John Eckersberg and Michele Baldessari (in CC) in Stein, but when nova-api runs under Apache mod_wsgi.

The symptoms for us seem similar: sometimes the scheduling of the AMQP heartbeat greenlet seems heratic (no run for 75 seconds), RabbitMQ drops the connection to the client during this idle period, and next time nova-api needs to communicate over AMQP it logs a "connection reset by peer" and reestablish a new connection.

This is fresh news from yesterday so this is still to be confirmed/disproved; but currently we're investigating how eventlet schedules recurring tasks such as the AMQP heartbeat greenlet under Apache mod_wsgi's execution model.

Once the AMQP heartbeat greenlet has sent its heartbeat, it sleeps for a fixed amount of time; i.e. it suspends its execution, eventlet registers a timer for this greenlet, and continues scheduling other greenlets. Once the timer expires, eventlet marks the heartbeat greenlet as "eligible for scheduling" and eventually this greenlet resumes its execution. For that mechanism to work, eventlet regularly watches for all the timers that are registered in its event loop (with select/epoll/whatnot).

Now, Apache mod_wsgi embeds its own Python interpreter, and calls it whenever there's an API call to process. It seems that - the precise course of event is unclear to me yet - once the API call is processed, the Python Interpreter returns a result to mod_wsgi and stops its execution even if some timers were still registered in eventlet.

The state of the Python Interpreter is conserved though, and next time mod_wsgi receives an API call, it resumes the execution of the Python Interpreter: eventlet resumes its execution and process every timer that were registered previously. So ultimately, the AMQP heartbeat greenlet is rescheduled, but that might be a long time already since the last AMQP heartbeat.

So I have the feeling that when there's no traffic, mod_wsgi doesn't execute the Python interpreter, which prevents eventlet from running, watching for timers, and triggering the AMQP heartbeat greenlet. So far, the evidences in favour of this theory are: . When there's no traffic for nova-api running under mod_wsgi, and I use gdb to dump the call stack of the mod_wsgi process, I see no CPython interpreter in the stack. Whereas when there's traffic, I see many CPython stack frames. . When there's no traffic to nova-api and the AMQP heartbeat greenlet seem to be unscheduled for a long time, running a simple 'nova list' command seems to unblock everything and trigger a AMQP heartbeat.

This would also explain why running the nova-api service in standalone works just fine as you said. because the Python interpreter is never stopped in this execution model.

I'm still trying to figure out when the Python interpreter returns to mod_wsgi to make sure what I'm saying makes sense. And I also need to double check that running in standalone works fine for me as well.

The symptom is intermittent failure of API operations that involve RPC calls - notably getting the noVNC console URL for an instance. The nova-api log shows RPC timeouts and "connection reset by peer" for AMQP. I've ruled out environmental issues, as a Rocky installation on the same hosts works fine. With oslo debug enabled, I see that the first API call causes a AMQP connections to be established, and the RPC call gets its reply, and all is well... but there are no heartbeats logged after that. 3 minutes later, rabbitmq-server reports that it has received no heartbeats, and drops the connection. On a subsequent API call (but not before), nova-api logging "connection reset by peer" and the RPC call times out. With the Rocky code, once the first API call causes the AMQP connections to be established, regular heartbeats start happening, the connections are maintained, and subsequent API calls work as expected.

It seems like a lockup in a thread that's supposed to maintain the AMQP connection, or something. I tried to use SIGUSR2 to trigger guru meditation, but it seems that uWSGI has its own interpretation of SIGUSR2, so nova-api never sees it (anyone know a way around this?).

Running nova-api as a standalone service, instead of under uWSGI, the problem does not occur (I see the heartbeats, and everything works fine).

I've tried updating oslo-messaging (and python-amqp) on a Rocky installation to the Stein version, but the problem does not reproduce.

Ideas?

~iain

-- Damien

iain.macdonnell＠oracle.com

10:20 p.m.

Today I discovered that this problem appears to be caused by eventlet monkey-patching. I've created a bug for it: https://bugs.launchpad.net/nova/+bug/1825584 ~iain On 4/18/19 11:27 PM, iain.macdonnell@oracle.com wrote:

...

Hi Damien,

Thanks for the response! This definitely sounds similar enough that it has to be the same root problem.

I didn't make too much progress today. Whilst using strace to try to compare Rocky and Stein, I noticed that TCP connections seem to be handled somewhat differently. This applies to mysql as well as AMQP. In the Rocky case, I see repeated fcntl() calls to F_SETFL O_RDWR between sendto() and recvfrom():

[pid 32427] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 19 [pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 32427] connect(19, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 32427] poll([{fd=19, events=POLLOUT}], 1, 10000) = 1 ([{fd=19, revents=POLLOUT}]) [pid 32427] getsockopt(19, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 [pid 32427] setsockopt(19, SOL_TCP, TCP_NODELAY, [1], 4) = 0 [pid 32427] fcntl(19, F_GETFL)          = 0x802 (flags O_RDWR|O_NONBLOCK) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] setsockopt(19, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 [pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] recvfrom(19, "J\0\0\0\n5.7.25\0\f\r\0\0#\4;;\n)\0309\0\377\377\10\2\0\377\301"..., 8192, 0, NULL, NULL) = 78 [pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] sendto(19, "\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 150, 0, NULL, 0) = 150 [pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] recvfrom(19, ",\0\0\2\376mysql_native_password\0#\4;;\n"..., 8192, 0, NULL, NULL) = 48

whereas in the Stein case, I see recvfrom() calls that get EAGAIN, followed by epoll before a successful recvfrom():

[pid 17895] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 28 [pid 17895] fcntl(28, F_GETFL)          = 0x2 (flags O_RDWR) [pid 17895] fcntl(28, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0 [pid 17895] epoll_wait(26, [{EPOLLOUT, {u32=28, u64=6398156018563416092}}], 1023, 1992) = 1 [pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f7440) = 0 [pid 17895] getsockopt(28, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 [pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("10.241.154.33")}, 16) = 0 [pid 17895] setsockopt(28, SOL_TCP, TCP_NODELAY, [1], 4) = 0 [pid 17895] setsockopt(28, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 [pid 17895] recvfrom(28, "J\0\0\0\n5.7.25\0(!\0\0\10w@\v-~)D\0\377\377\10\2\0\377\301"..., 8192, 0, NULL, NULL) = 78 [pid 17895] sendto(28, "\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 150, 0, NULL, 0) = 150 [pid 17895] recvfrom(28, 0x7fb5d401e340, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0 [pid 17895] epoll_wait(26, [{EPOLLIN, {u32=28, u64=6398156018563416092}}], 1023, 1991) = 1 [pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f6530) = 0 [pid 17895] recvfrom(28, ",\0\0\2\376mysql_native_password\0\10w@\v-"..., 8192, 0, NULL, NULL) = 48

Not sure if this is related to the problem or not, but it's what jumped out first, and I wasn't able to figure out what's causing the change in behaviour before I ran out of enthusiasm today.

Please let me know if you discover anything further or/and if there are any discussions (IRC?) that I should be participating in .. although I'm guessing that that's happening during the EU work day.

Thanks!

    ~iain

On 4/18/19 12:48 AM, Damien Ciabrini wrote:

...
On Thu, Apr 18, 2019 at 7:51 AM <iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com>> wrote:

    Is anyone running Stein nova-api under uWSGI? I'm trying to do this, in     exactly the same way as I did with Rocky, but encountering a new     problem...

This sounds similar to a problem that we're currently debugging with John Eckersberg and Michele Baldessari (in CC) in Stein, but when nova-api runs under Apache mod_wsgi.

The symptoms for us seem similar: sometimes the scheduling of the AMQP heartbeat greenlet seems heratic (no run for 75 seconds), RabbitMQ drops the connection to the client during this idle period, and next time nova-api needs to communicate over AMQP it logs a "connection reset by peer" and reestablish a new connection.

This is fresh news from yesterday so this is still to be confirmed/disproved; but currently we're investigating how eventlet schedules recurring tasks such as the AMQP heartbeat greenlet under Apache mod_wsgi's execution model.

Once the AMQP heartbeat greenlet has sent its heartbeat, it sleeps for a fixed amount of time; i.e. it suspends its execution, eventlet registers a timer for this greenlet, and continues scheduling other greenlets. Once the timer expires, eventlet marks the heartbeat greenlet as "eligible for scheduling" and eventually this greenlet resumes its execution. For that mechanism to work, eventlet regularly watches for all the timers that are registered in its event loop (with select/epoll/whatnot).

Now, Apache mod_wsgi embeds its own Python interpreter, and calls it whenever there's an API call to process. It seems that - the precise course of event is unclear to me yet - once the API call is processed, the Python Interpreter returns a result to mod_wsgi and stops its execution even if some timers were still registered in eventlet.

The state of the Python Interpreter is conserved though, and next time mod_wsgi receives an API call, it resumes the execution of the Python Interpreter: eventlet resumes its execution and process every timer that were registered previously. So ultimately, the AMQP heartbeat greenlet is rescheduled, but that might be a long time already since the last AMQP heartbeat.

So I have the feeling that when there's no traffic, mod_wsgi doesn't execute the Python interpreter, which prevents eventlet from running, watching for timers, and triggering the AMQP heartbeat greenlet. So far, the evidences in favour of this theory are:    . When there's no traffic for nova-api running under mod_wsgi, and I use      gdb to dump the call stack of the mod_wsgi process, I see no CPython      interpreter in the stack. Whereas when there's traffic, I see many CPython      stack frames.    . When there's no traffic to nova-api and the AMQP heartbeat greenlet seem      to be unscheduled for a long time, running a simple 'nova list' command      seems to unblock everything and trigger a AMQP heartbeat.

This would also explain why running the nova-api service in standalone works just fine as you said. because the Python interpreter is never stopped in this execution model.

I'm still trying to figure out when the Python interpreter returns to mod_wsgi to make sure what I'm saying makes sense. And I also need to double check that running in standalone works fine for me as well.

    The symptom is intermittent failure of API operations that involve RPC     calls - notably getting the noVNC console URL for an instance. The     nova-api log shows RPC timeouts and "connection reset by peer" for     AMQP.     I've ruled out environmental issues, as a Rocky installation on the     same     hosts works fine. With oslo debug enabled, I see that the first API     call     causes a AMQP connections to be established, and the RPC call gets its     reply, and all is well... but there are no heartbeats logged after     that.     3 minutes later, rabbitmq-server reports that it has received no     heartbeats, and drops the connection. On a subsequent API call (but not     before), nova-api logging "connection reset by peer" and the RPC call     times out. With the Rocky code, once the first API call causes the AMQP     connections to be established, regular heartbeats start happening, the     connections are maintained, and subsequent API calls work as expected.

    It seems like a lockup in a thread that's supposed to maintain the AMQP     connection, or something. I tried to use SIGUSR2 to trigger guru     meditation, but it seems that uWSGI has its own interpretation of     SIGUSR2, so nova-api never sees it (anyone know a way around this?).

    Running nova-api as a standalone service, instead of under uWSGI, the     problem does not occur (I see the heartbeats, and everything works     fine).

    I've tried updating oslo-messaging (and python-amqp) on a Rocky     installation to the Stein version, but the problem does not reproduce.

    Ideas?

     ~iain

-- Damien

Michele Baldessari

20 Apr 20 Apr

6:38 a.m.

On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com wrote:

...

Today I discovered that this problem appears to be caused by eventlet monkey-patching. I've created a bug for it:

https://bugs.launchpad.net/nova/+bug/1825584

Hi, just for completeness we see this very same issue also with mistral (actually it was the first service where we noticed the missed heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, although I have not personally observed it there yet. cheers, Michele

...

~iain

On 4/18/19 11:27 PM, iain.macdonnell@oracle.com wrote:

...
Hi Damien,

Thanks for the response! This definitely sounds similar enough that it has to be the same root problem.

I didn't make too much progress today. Whilst using strace to try to compare Rocky and Stein, I noticed that TCP connections seem to be handled somewhat differently. This applies to mysql as well as AMQP. In the Rocky case, I see repeated fcntl() calls to F_SETFL O_RDWR between sendto() and recvfrom():

[pid 32427] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 19 [pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 32427] connect(19, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 32427] poll([{fd=19, events=POLLOUT}], 1, 10000) = 1 ([{fd=19, revents=POLLOUT}]) [pid 32427] getsockopt(19, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 [pid 32427] setsockopt(19, SOL_TCP, TCP_NODELAY, [1], 4) = 0 [pid 32427] fcntl(19, F_GETFL)          = 0x802 (flags O_RDWR|O_NONBLOCK) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] setsockopt(19, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 [pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] recvfrom(19, "J\0\0\0\n5.7.25\0\f\r\0\0#\4;;\n)\0309\0\377\377\10\2\0\377\301"..., 8192, 0, NULL, NULL) = 78 [pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] sendto(19, "\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 150, 0, NULL, 0) = 150 [pid 32427] fcntl(19, F_GETFL)          = 0x2 (flags O_RDWR) [pid 32427] fcntl(19, F_SETFL, O_RDWR) = 0 [pid 32427] recvfrom(19, ",\0\0\2\376mysql_native_password\0#\4;;\n"..., 8192, 0, NULL, NULL) = 48

whereas in the Stein case, I see recvfrom() calls that get EAGAIN, followed by epoll before a successful recvfrom():

[pid 17895] socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 28 [pid 17895] fcntl(28, F_GETFL)          = 0x2 (flags O_RDWR) [pid 17895] fcntl(28, F_SETFL, O_RDWR|O_NONBLOCK) = 0 [pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("x.x.x.x")}, 16) = -1 EINPROGRESS (Operation now in progress) [pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28, {EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0 [pid 17895] epoll_wait(26, [{EPOLLOUT, {u32=28, u64=6398156018563416092}}], 1023, 1992) = 1 [pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f7440) = 0 [pid 17895] getsockopt(28, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 [pid 17895] connect(28, {sa_family=AF_INET, sin_port=htons(3306), sin_addr=inet_addr("10.241.154.33")}, 16) = 0 [pid 17895] setsockopt(28, SOL_TCP, TCP_NODELAY, [1], 4) = 0 [pid 17895] setsockopt(28, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 [pid 17895] recvfrom(28, "J\0\0\0\n5.7.25\0(!\0\0\10w@\v-~)D\0\377\377\10\2\0\377\301"..., 8192, 0, NULL, NULL) = 78 [pid 17895] sendto(28, "\222\0\0\1\17\242:\0\377\377\377\0!\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 150, 0, NULL, 0) = 150 [pid 17895] recvfrom(28, 0x7fb5d401e340, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 17895] epoll_ctl(26, EPOLL_CTL_ADD, 28, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=28, u64=6398156018563416092}}) = 0 [pid 17895] epoll_wait(26, [{EPOLLIN, {u32=28, u64=6398156018563416092}}], 1023, 1991) = 1 [pid 17895] epoll_ctl(26, EPOLL_CTL_DEL, 28, 0x7fb5ed1f6530) = 0 [pid 17895] recvfrom(28, ",\0\0\2\376mysql_native_password\0\10w@\v-"..., 8192, 0, NULL, NULL) = 48

Not sure if this is related to the problem or not, but it's what jumped out first, and I wasn't able to figure out what's causing the change in behaviour before I ran out of enthusiasm today.

Please let me know if you discover anything further or/and if there are any discussions (IRC?) that I should be participating in .. although I'm guessing that that's happening during the EU work day.

Thanks!

    ~iain

On 4/18/19 12:48 AM, Damien Ciabrini wrote:

...
On Thu, Apr 18, 2019 at 7:51 AM <iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com>> wrote:

    Is anyone running Stein nova-api under uWSGI? I'm trying to do this, in     exactly the same way as I did with Rocky, but encountering a new     problem...

This sounds similar to a problem that we're currently debugging with John Eckersberg and Michele Baldessari (in CC) in Stein, but when nova-api runs under Apache mod_wsgi.

The symptoms for us seem similar: sometimes the scheduling of the AMQP heartbeat greenlet seems heratic (no run for 75 seconds), RabbitMQ drops the connection to the client during this idle period, and next time nova-api needs to communicate over AMQP it logs a "connection reset by peer" and reestablish a new connection.

This is fresh news from yesterday so this is still to be confirmed/disproved; but currently we're investigating how eventlet schedules recurring tasks such as the AMQP heartbeat greenlet under Apache mod_wsgi's execution model.

Once the AMQP heartbeat greenlet has sent its heartbeat, it sleeps for a fixed amount of time; i.e. it suspends its execution, eventlet registers a timer for this greenlet, and continues scheduling other greenlets. Once the timer expires, eventlet marks the heartbeat greenlet as "eligible for scheduling" and eventually this greenlet resumes its execution. For that mechanism to work, eventlet regularly watches for all the timers that are registered in its event loop (with select/epoll/whatnot).

Now, Apache mod_wsgi embeds its own Python interpreter, and calls it whenever there's an API call to process. It seems that - the precise course of event is unclear to me yet - once the API call is processed, the Python Interpreter returns a result to mod_wsgi and stops its execution even if some timers were still registered in eventlet.

The state of the Python Interpreter is conserved though, and next time mod_wsgi receives an API call, it resumes the execution of the Python Interpreter: eventlet resumes its execution and process every timer that were registered previously. So ultimately, the AMQP heartbeat greenlet is rescheduled, but that might be a long time already since the last AMQP heartbeat.

So I have the feeling that when there's no traffic, mod_wsgi doesn't execute the Python interpreter, which prevents eventlet from running, watching for timers, and triggering the AMQP heartbeat greenlet. So far, the evidences in favour of this theory are:    . When there's no traffic for nova-api running under mod_wsgi, and I use      gdb to dump the call stack of the mod_wsgi process, I see no CPython      interpreter in the stack. Whereas when there's traffic, I see many CPython      stack frames.    . When there's no traffic to nova-api and the AMQP heartbeat greenlet seem      to be unscheduled for a long time, running a simple 'nova list' command      seems to unblock everything and trigger a AMQP heartbeat.

This would also explain why running the nova-api service in standalone works just fine as you said. because the Python interpreter is never stopped in this execution model.

I'm still trying to figure out when the Python interpreter returns to mod_wsgi to make sure what I'm saying makes sense. And I also need to double check that running in standalone works fine for me as well.

    The symptom is intermittent failure of API operations that involve RPC     calls - notably getting the noVNC console URL for an instance. The     nova-api log shows RPC timeouts and "connection reset by peer" for     AMQP.     I've ruled out environmental issues, as a Rocky installation on the     same     hosts works fine. With oslo debug enabled, I see that the first API     call     causes a AMQP connections to be established, and the RPC call gets its     reply, and all is well... but there are no heartbeats logged after     that.     3 minutes later, rabbitmq-server reports that it has received no     heartbeats, and drops the connection. On a subsequent API call (but not     before), nova-api logging "connection reset by peer" and the RPC call     times out. With the Rocky code, once the first API call causes the AMQP     connections to be established, regular heartbeats start happening, the     connections are maintained, and subsequent API calls work as expected.

    It seems like a lockup in a thread that's supposed to maintain the AMQP     connection, or something. I tried to use SIGUSR2 to trigger guru     meditation, but it seems that uWSGI has its own interpretation of     SIGUSR2, so nova-api never sees it (anyone know a way around this?).

    Running nova-api as a standalone service, instead of under uWSGI, the     problem does not occur (I see the heartbeats, and everything works     fine).

    I've tried updating oslo-messaging (and python-amqp) on a Rocky     installation to the Stein version, but the problem does not reproduce.

    Ideas?

     ~iain

-- Damien

-- Michele Baldessari <michele@acksyn.org> C2A5 9DA3 9961 4FFB E01B D0BC DDD4 DCCB 7515 5C6D

Ben Nemec

22 Apr 22 Apr

5:24 p.m.

On 4/20/19 1:38 AM, Michele Baldessari wrote:

...

On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com wrote:

...
Today I discovered that this problem appears to be caused by eventlet monkey-patching. I've created a bug for it:

https://bugs.launchpad.net/nova/+bug/1825584

Hi,

just for completeness we see this very same issue also with mistral (actually it was the first service where we noticed the missed heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, although I have not personally observed it there yet.

Is Mistral also mixing eventlet monkeypatching and WSGI?

Alex Schultz

5:53 p.m.

On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com> wrote:

...

On 4/20/19 1:38 AM, Michele Baldessari wrote:

...
On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com wrote:

...
Today I discovered that this problem appears to be caused by eventlet monkey-patching. I've created a bug for it:

https://bugs.launchpad.net/nova/+bug/1825584

Hi,

just for completeness we see this very same issue also with mistral (actually it was the first service where we noticed the missed heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, although I have not personally observed it there yet.

Is Mistral also mixing eventlet monkeypatching and WSGI?

Looks like there is monkey patching, however we noticed it with the engine/executor. So it's likely not just wsgi. I think I also saw it in the ironic-conductor, though I'd have to try it out again. I'll spin up an undercloud today and see if I can get a more complete list of affected services. It was pretty easy to reproduce. Thanks, -Alex

Ben Nemec

6:21 p.m.

On 4/22/19 12:53 PM, Alex Schultz wrote:

...

On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com> wrote:

...
On 4/20/19 1:38 AM, Michele Baldessari wrote:

...
On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com wrote:

...
Today I discovered that this problem appears to be caused by eventlet monkey-patching. I've created a bug for it:

https://bugs.launchpad.net/nova/+bug/1825584

Hi,

just for completeness we see this very same issue also with mistral (actually it was the first service where we noticed the missed heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, although I have not personally observed it there yet.

Is Mistral also mixing eventlet monkeypatching and WSGI?

Looks like there is monkey patching, however we noticed it with the engine/executor. So it's likely not just wsgi. I think I also saw it in the ironic-conductor, though I'd have to try it out again. I'll spin up an undercloud today and see if I can get a more complete list of affected services. It was pretty easy to reproduce.

Okay, I asked because if there's no WSGI/Eventlet combination then this may be different from the Nova issue that prompted this thread. It sounds like that was being caused by a bad interaction between WSGI and some Eventlet timers. If there's no WSGI involved then I wouldn't expect that to happen. I guess we'll see what further investigation turns up, but based on the preliminary information there may be two bugs here.

Alex Schultz

11:17 p.m.

On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack@nemebean.com> wrote:

...

On 4/22/19 12:53 PM, Alex Schultz wrote:

...
On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com> wrote:

...
On 4/20/19 1:38 AM, Michele Baldessari wrote:

...
On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com wrote:

...
Today I discovered that this problem appears to be caused by eventlet monkey-patching. I've created a bug for it:

https://bugs.launchpad.net/nova/+bug/1825584

Hi,

just for completeness we see this very same issue also with mistral (actually it was the first service where we noticed the missed heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, although I have not personally observed it there yet.

Is Mistral also mixing eventlet monkeypatching and WSGI?

Looks like there is monkey patching, however we noticed it with the engine/executor. So it's likely not just wsgi. I think I also saw it in the ironic-conductor, though I'd have to try it out again. I'll spin up an undercloud today and see if I can get a more complete list of affected services. It was pretty easy to reproduce.

Okay, I asked because if there's no WSGI/Eventlet combination then this may be different from the Nova issue that prompted this thread. It sounds like that was being caused by a bad interaction between WSGI and some Eventlet timers. If there's no WSGI involved then I wouldn't expect that to happen.

I guess we'll see what further investigation turns up, but based on the preliminary information there may be two bugs here.

So I wasn't able to reproduce the ironic issues yet. But it's the mistral executor and nova-api which exhibit the issue on the undercloud. mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR oslo.messaging._drivers.impl_rabbit [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe 9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe - default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection The errors being thrown are different perhaps it is two different problems. Thanks, -Alex

Damien Ciabrini

25 Apr 25 Apr

8:05 a.m.

On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz@redhat.com> wrote:

...

On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack@nemebean.com> wrote:

...
On 4/22/19 12:53 PM, Alex Schultz wrote:

...
On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com>

wrote:

...
...
...
On 4/20/19 1:38 AM, Michele Baldessari wrote:

...
On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com

wrote:

...
...
...
Today I discovered that this problem appears to be caused by

eventlet

...
monkey-patching. I've created a bug for it:

https://bugs.launchpad.net/nova/+bug/1825584

Hi,

just for completeness we see this very same issue also with mistral (actually it was the first service where we noticed the missed heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, although I have not personally observed it there yet.

Is Mistral also mixing eventlet monkeypatching and WSGI?

Looks like there is monkey patching, however we noticed it with the engine/executor. So it's likely not just wsgi. I think I also saw it in the ironic-conductor, though I'd have to try it out again. I'll spin up an undercloud today and see if I can get a more complete list of affected services. It was pretty easy to reproduce.

Okay, I asked because if there's no WSGI/Eventlet combination then this may be different from the Nova issue that prompted this thread. It sounds like that was being caused by a bad interaction between WSGI and some Eventlet timers. If there's no WSGI involved then I wouldn't expect that to happen.

I guess we'll see what further investigation turns up, but based on the preliminary information there may be two bugs here.

So I wasn't able to reproduce the ironic issues yet. But it's the mistral executor and nova-api which exhibit the issue on the undercloud.

mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer

nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR oslo.messaging._drivers.impl_rabbit [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe 9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe - default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection

The errors being thrown are different perhaps it is two different problems.

Correct, I think our original issue with erratic AMQP hearbeats and mod_wsgi was due to a change in how we run healthcheck in Stein in TripleO-deployed environments, so different to what Iain originally experienced it seems... For the record, up to Rocky, we used to run healthcheck scripts every 30 seconds, which guarantees that eventlet will wake up and send an AMQP heartbeat packet if a service had no AMQP traffic in the last 15s. It also guarantees that any incoming AMQP heartbeat packet from rabbitmq will be processed in at most 30s. In Stein, our healthchecks are now triggered via systemd timers, and the current time setting is too high to guarantee that mod_wsgi will always wake up on time to send/receive AMQP heartbeats to/from rabbitmq when there's no traffic. The fix is being tracked in https://bugs.launchpad.net/tripleo/+bug/1826281

...

Thanks, -Alex

-- Damien

iain.macdonnell＠oracle.com

3:10 p.m.

On 4/25/19 1:05 AM, Damien Ciabrini wrote:

...

On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz@redhat.com <mailto:aschultz@redhat.com>> wrote:

On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack@nemebean.com <mailto:openstack@nemebean.com>> wrote: > > > > On 4/22/19 12:53 PM, Alex Schultz wrote: > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com <mailto:openstack@nemebean.com>> wrote: > >> > >> > >> > >> On 4/20/19 1:38 AM, Michele Baldessari wrote: > >>> On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com> wrote: > >>>> > >>>> Today I discovered that this problem appears to be caused by eventlet > >>>> monkey-patching. I've created a bug for it: > >>>> > >>>> https://bugs.launchpad.net/nova/+bug/1825584 <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=fpephoqJ2hzc-jFPr6Rtupc2U02HBjBRd-_Lq66zQBk&s=Vb1Yw7ZkrglH0AMBZSHNfuilS5gBwh9yF2o2trkXYyM&e=> > >>> > >>> Hi, > >>> > >>> just for completeness we see this very same issue also with > >>> mistral (actually it was the first service where we noticed the missed > >>> heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, > >>> although I have not personally observed it there yet. > >> > >> Is Mistral also mixing eventlet monkeypatching and WSGI? > >> > > > > Looks like there is monkey patching, however we noticed it with the > > engine/executor. So it's likely not just wsgi. I think I also saw it > > in the ironic-conductor, though I'd have to try it out again. I'll > > spin up an undercloud today and see if I can get a more complete list > > of affected services. It was pretty easy to reproduce. > > Okay, I asked because if there's no WSGI/Eventlet combination then this > may be different from the Nova issue that prompted this thread. It > sounds like that was being caused by a bad interaction between WSGI and > some Eventlet timers. If there's no WSGI involved then I wouldn't expect > that to happen. > > I guess we'll see what further investigation turns up, but based on the > preliminary information there may be two bugs here. >

So I wasn't able to reproduce the ironic issues yet. But it's the mistral executor and nova-api which exhibit the issue on the undercloud.

mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer

nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR oslo.messaging._drivers.impl_rabbit [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe 9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe - default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection

The errors being thrown are different perhaps it is two different problems.

Correct, I think our original issue with erratic AMQP hearbeats and mod_wsgi was due to a change in how we run healthcheck in Stein in TripleO-deployed environments, so different to what Iain originally experienced it seems...

For the record, up to Rocky, we used to run healthcheck scripts every 30 seconds, which guarantees that eventlet will wake up and send an AMQP heartbeat packet if a service had no AMQP traffic in the last 15s. It also guarantees that any incoming AMQP heartbeat packet from rabbitmq will be processed in at most 30s.

In Stein, our healthchecks are now triggered via systemd timers, and the current time setting is too high to guarantee that mod_wsgi will always wake up on time to send/receive AMQP heartbeats to/from rabbitmq when there's no traffic.

If I'm reading this right, it sounds like the periodic healthcheck is working around the underlying issue of the heartbeats not happening by themselves (due to eventlet monkey-patching somehow interfering with threading). The whole point of the heartbeats is to continually maintain the connection while there's no real traffic. ~iain

Ken Giusti

6:19 p.m.

On Thu, Apr 25, 2019 at 11:19 AM <iain.macdonnell@oracle.com> wrote:

...

On 4/25/19 1:05 AM, Damien Ciabrini wrote:

...
On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz@redhat.com <mailto:aschultz@redhat.com>> wrote:

On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack@nemebean.com <mailto:openstack@nemebean.com>> wrote: > > > > On 4/22/19 12:53 PM, Alex Schultz wrote: > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com <mailto:openstack@nemebean.com>> wrote: > >> > >> > >> > >> On 4/20/19 1:38 AM, Michele Baldessari wrote: > >>> On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com>

wrote:

...
> >>>> > >>>> Today I discovered that this problem appears to be caused by eventlet > >>>> monkey-patching. I've created a bug for it: > >>>> > >>>> https://bugs.launchpad.net/nova/+bug/1825584 <

https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=fpephoqJ2hzc-jFPr6Rtupc2U02HBjBRd-_Lq66zQBk&s=Vb1Yw7ZkrglH0AMBZSHNfuilS5gBwh9yF2o2trkXYyM&e=

...
> >>> > >>> Hi, > >>> > >>> just for completeness we see this very same issue also with > >>> mistral (actually it was the first service where we noticed the missed > >>> heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, > >>> although I have not personally observed it there yet. > >> > >> Is Mistral also mixing eventlet monkeypatching and WSGI? > >> > > > > Looks like there is monkey patching, however we noticed it with

the

...
> > engine/executor. So it's likely not just wsgi. I think I also saw it > > in the ironic-conductor, though I'd have to try it out again.

I'll

...
> > spin up an undercloud today and see if I can get a more complete list > > of affected services. It was pretty easy to reproduce. > > Okay, I asked because if there's no WSGI/Eventlet combination then this > may be different from the Nova issue that prompted this thread. It > sounds like that was being caused by a bad interaction between WSGI and > some Eventlet timers. If there's no WSGI involved then I wouldn't expect > that to happen. > > I guess we'll see what further investigation turns up, but based on the > preliminary information there may be two bugs here. >

So I wasn't able to reproduce the ironic issues yet. But it's the mistral executor and nova-api which exhibit the issue on the undercloud.

mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR oslo.messaging._drivers.impl_rabbit [-] [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer

nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR oslo.messaging._drivers.impl_rabbit [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe 9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe - default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server

on

...
undercloud-0.ctlplane.localdomain:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection

The errors being thrown are different perhaps it is two different problems.

Correct, I think our original issue with erratic AMQP hearbeats and

mod_wsgi

...
was due to a change in how we run healthcheck in Stein in TripleO-deployed environments, so different to what Iain originally experienced it seems...

For the record, up to Rocky, we used to run healthcheck scripts every 30 seconds, which guarantees that eventlet will wake up and send an AMQP heartbeat packet if a service had no AMQP traffic in the last 15s. It also guarantees that any incoming AMQP heartbeat packet from rabbitmq will be processed in at most 30s.

In Stein, our healthchecks are now triggered via systemd timers, and the current time setting is too high to guarantee that mod_wsgi will always wake up on time to send/receive AMQP heartbeats to/from rabbitmq when there's no traffic.

If I'm reading this right, it sounds like the periodic healthcheck is working around the underlying issue of the heartbeats not happening by themselves (due to eventlet monkey-patching somehow interfering with threading). The whole point of the heartbeats is to continually maintain the connection while there's no real traffic.

~iain

I would agree with Iain here - it sounds as if something in mod_wsgi (or elsewhere) is blocking in the os outside of the purview of eventlet. FWIW here's the oslo.messaging thread responsible for sending heartbeats from the RPC client: https://opendev.org/openstack/oslo.messaging/src/branch/stable/stein/oslo_me... Can you verify that the event.wait() call at the bottom of the loop is not waking up as per the passed in _heartbeat_wait_timeout? thanks -- Ken Giusti (kgiusti@gmail.com)

iain.macdonnell＠oracle.com

26 Apr 26 Apr

4:45 a.m.

On 4/25/19 11:19 AM, Ken Giusti wrote:

...

On Thu, Apr 25, 2019 at 11:19 AM <iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com>> wrote:

On 4/25/19 1:05 AM, Damien Ciabrini wrote: > > > On Tue, Apr 23, 2019 at 1:18 AM Alex Schultz <aschultz@redhat.com <mailto:aschultz@redhat.com> > <mailto:aschultz@redhat.com <mailto:aschultz@redhat.com>>> wrote: > > On Mon, Apr 22, 2019 at 12:25 PM Ben Nemec <openstack@nemebean.com <mailto:openstack@nemebean.com> > <mailto:openstack@nemebean.com <mailto:openstack@nemebean.com>>> wrote: > > > > > > > > On 4/22/19 12:53 PM, Alex Schultz wrote: > > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec > <openstack@nemebean.com <mailto:openstack@nemebean.com> <mailto:openstack@nemebean.com <mailto:openstack@nemebean.com>>> wrote: > > >> > > >> > > >> > > >> On 4/20/19 1:38 AM, Michele Baldessari wrote: > > >>> On Fri, Apr 19, 2019 at 03:20:44PM -0700, > iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com> <mailto:iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com>> wrote: > > >>>> > > >>>> Today I discovered that this problem appears to be caused by > eventlet > > >>>> monkey-patching. I've created a bug for it: > > >>>> > > >>>> https://bugs.launchpad.net/nova/+bug/1825584 <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=yuWVlqvT-TMIF0OQxtByfrZjswKo56fkyrbKrvy9oHs&s=mJYdHr4mPQVjex9HrhekR_IvRsktOkvDNGQBnuB_ehI&e=> > <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=fpephoqJ2hzc-jFPr6Rtupc2U02HBjBRd-_Lq66zQBk&s=Vb1Yw7ZkrglH0AMBZSHNfuilS5gBwh9yF2o2trkXYyM&e=> > > >>> > > >>> Hi, > > >>> > > >>> just for completeness we see this very same issue also with > > >>> mistral (actually it was the first service where we noticed > the missed > > >>> heartbeats). iirc Alex Schultz mentioned seeing it in ironic > as well, > > >>> although I have not personally observed it there yet. > > >> > > >> Is Mistral also mixing eventlet monkeypatching and WSGI? > > >> > > > > > > Looks like there is monkey patching, however we noticed it with the > > > engine/executor. So it's likely not just wsgi. I think I also > saw it > > > in the ironic-conductor, though I'd have to try it out again. I'll > > > spin up an undercloud today and see if I can get a more > complete list > > > of affected services. It was pretty easy to reproduce. > > > > Okay, I asked because if there's no WSGI/Eventlet combination > then this > > may be different from the Nova issue that prompted this thread. It > > sounds like that was being caused by a bad interaction between > WSGI and > > some Eventlet timers. If there's no WSGI involved then I wouldn't > expect > > that to happen. > > > > I guess we'll see what further investigation turns up, but based > on the > > preliminary information there may be two bugs here. > > > > So I wasn't able to reproduce the ironic issues yet. But it's the > mistral executor and nova-api which exhibit the issue on the > undercloud. > > mistral/executor.log:2019-04-22 22:40:58.321 7 ERROR > oslo.messaging._drivers.impl_rabbit [-] > [b7b4bc40-767c-4de1-b77b-6a5822f6beed] AMQP server on > undercloud-0.ctlplane.localdomain:5672 is unreachable: [Errno 104] > Connection reset by peer. Trying again in 1 seconds.: > ConnectionResetError: [Errno 104] Connection reset by peer > > > nova/nova-api.log:2019-04-22 22:38:11.530 19 ERROR > oslo.messaging._drivers.impl_rabbit > [req-d7767aed-e32d-43db-96a8-c0509bfb1cfe > 9ac89090d2d24949b9a1e01b1afb14cc 7becac88cbae4b3b962ecccaf536effe - > default default] [c0f3fe7f-db89-42c6-95bd-f367a4fbf680] AMQP server on > undercloud-0.ctlplane.localdomain:5672 is unreachable: Server > unexpectedly closed connection. Trying again in 1 seconds.: OSError: > Server unexpectedly closed connection > > The errors being thrown are different perhaps it is two different > problems. > > > Correct, I think our original issue with erratic AMQP hearbeats and mod_wsgi > was due to a change in how we run healthcheck in Stein in TripleO-deployed > environments, so different to what Iain originally experienced it seems... > > For the record, up to Rocky, we used to run healthcheck scripts > every 30 seconds, which guarantees that eventlet will wake up and > send an AMQP heartbeat packet if a service had no AMQP traffic in the last > 15s. It also guarantees that any incoming AMQP heartbeat packet from > rabbitmq will be processed in at most 30s. > > In Stein, our healthchecks are now triggered via systemd timers, and the > current time setting is too high to guarantee that mod_wsgi will always > wake up on time to send/receive AMQP heartbeats to/from rabbitmq > when there's no traffic.

If I'm reading this right, it sounds like the periodic healthcheck is working around the underlying issue of the heartbeats not happening by themselves (due to eventlet monkey-patching somehow interfering with threading). The whole point of the heartbeats is to continually maintain the connection while there's no real traffic.

~iain

I would agree with Iain here - it sounds as if something in mod_wsgi (or elsewhere) is blocking in the os outside of the purview of eventlet.

FWIW here's the oslo.messaging thread responsible for sending heartbeats from the RPC client:

https://opendev.org/openstack/oslo.messaging/src/branch/stable/stein/oslo_me... <https://urldefense.proofpoint.com/v2/url?u=https-3A__opendev.org_openstack_oslo.messaging_src_branch_stable_stein_oslo-5Fmessaging_-5Fdrivers_impl-5Frabbit.py-23L897&d=DwMFaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=yuWVlqvT-TMIF0OQxtByfrZjswKo56fkyrbKrvy9oHs&s=dIFtanzYyLQXGnr8R9RMZLiIq-HaEn0BokQxsytq7ok&e=>

Can you verify that the event.wait() call at the bottom of the loop is not waking up as per the passed in _heartbeat_wait_timeout? thanks

I added extra debug logging before and after the call to wait(). The "before" line gets logged, but the "after" one does not (when monkey-patching is in place). Here's what it looks like with monkey-patching disabled: http://paste.openstack.org/show/kUMhrSN8aiyleDgQ99EC/ and with it enabled: http://paste.openstack.org/show/DlwG9iBjLvR7lX6NHlie/ In both cases, I had started the WSGI app, made an API call that required use of RPC, then waited for several minutes. ~iain

Yatin Karel

23 Apr 23 Apr

2:52 a.m.

On Mon, Apr 22, 2019 at 11:52 PM Ben Nemec <openstack@nemebean.com> wrote:

...

On 4/22/19 12:53 PM, Alex Schultz wrote:

...
On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com> wrote:

...
On 4/20/19 1:38 AM, Michele Baldessari wrote:

...
On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com wrote:

...
Today I discovered that this problem appears to be caused by eventlet monkey-patching. I've created a bug for it:

https://bugs.launchpad.net/nova/+bug/1825584

Hi,

just for completeness we see this very same issue also with mistral (actually it was the first service where we noticed the missed heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, although I have not personally observed it there yet.

Is Mistral also mixing eventlet monkeypatching and WSGI?

Looks like there is monkey patching, however we noticed it with the engine/executor. So it's likely not just wsgi. I think I also saw it in the ironic-conductor, though I'd have to try it out again. I'll spin up an undercloud today and see if I can get a more complete list of affected services. It was pretty easy to reproduce.

Okay, I asked because if there's no WSGI/Eventlet combination then this may be different from the Nova issue that prompted this thread. It sounds like that was being caused by a bad interaction between WSGI and some Eventlet timers. If there's no WSGI involved then I wouldn't expect that to happen.

Just to update(since Error(RPC) being discussed here and in the mistral bug(SSL+python3+wsgi) are different) this Thread that mistral mixes eventlet and wsgi, i filled few months back https://bugs.launchpad.net/mistral/+bug/1808953

...

I guess we'll see what further investigation turns up, but based on the preliminary information there may be two bugs here.

Michele Baldessari

3 May 3 May

5:59 p.m.

On Mon, Apr 22, 2019 at 01:21:03PM -0500, Ben Nemec wrote:

...

On 4/22/19 12:53 PM, Alex Schultz wrote:

...
On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com> wrote:

...
On 4/20/19 1:38 AM, Michele Baldessari wrote:

...
On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com wrote:

...
Today I discovered that this problem appears to be caused by eventlet monkey-patching. I've created a bug for it:

https://bugs.launchpad.net/nova/+bug/1825584

Hi,

just for completeness we see this very same issue also with mistral (actually it was the first service where we noticed the missed heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, although I have not personally observed it there yet.

Is Mistral also mixing eventlet monkeypatching and WSGI?

Looks like there is monkey patching, however we noticed it with the engine/executor. So it's likely not just wsgi. I think I also saw it in the ironic-conductor, though I'd have to try it out again. I'll spin up an undercloud today and see if I can get a more complete list of affected services. It was pretty easy to reproduce.

Okay, I asked because if there's no WSGI/Eventlet combination then this may be different from the Nova issue that prompted this thread. It sounds like that was being caused by a bad interaction between WSGI and some Eventlet timers. If there's no WSGI involved then I wouldn't expect that to happen.

I guess we'll see what further investigation turns up, but based on the preliminary information there may be two bugs here.

So just to get some closure on this error that we have seen around mistral executor and tripleo with python3: this was due to the ansible action that called subprocess which has a different implementation in python3 and so the monkeypatching needs to be adapted. Review which fixes it for us is here: https://review.opendev.org/#/c/656901/ Damien and I think the nova_api/eventlet/mod_wsgi has a separate root-cause (although we have not spent all too much time on that one yet) cheers. Michele -- Michele Baldessari <michele@acksyn.org> C2A5 9DA3 9961 4FFB E01B D0BC DDD4 DCCB 7515 5C6D

Damien Ciabrini

4 May 4 May

9:14 p.m.

On Fri, May 3, 2019 at 7:59 PM Michele Baldessari <michele@acksyn.org> wrote:

...

On Mon, Apr 22, 2019 at 01:21:03PM -0500, Ben Nemec wrote:

...
On 4/22/19 12:53 PM, Alex Schultz wrote:

...
On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com>

...
...
...
On 4/20/19 1:38 AM, Michele Baldessari wrote:

...
On Fri, Apr 19, 2019 at 03:20:44PM -0700,

iain.macdonnell@oracle.com wrote:

...
...
...
Today I discovered that this problem appears to be caused by

eventlet

...
monkey-patching. I've created a bug for it:

https://bugs.launchpad.net/nova/+bug/1825584

Hi,

just for completeness we see this very same issue also with mistral (actually it was the first service where we noticed the missed heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, although I have not personally observed it there yet.

Is Mistral also mixing eventlet monkeypatching and WSGI?

Looks like there is monkey patching, however we noticed it with the engine/executor. So it's likely not just wsgi. I think I also saw it in the ironic-conductor, though I'd have to try it out again. I'll spin up an undercloud today and see if I can get a more complete list of affected services. It was pretty easy to reproduce.

Okay, I asked because if there's no WSGI/Eventlet combination then this may be different from the Nova issue that prompted this thread. It sounds

wrote: like

...
that was being caused by a bad interaction between WSGI and some Eventlet timers. If there's no WSGI involved then I wouldn't expect that to happen.

I guess we'll see what further investigation turns up, but based on the preliminary information there may be two bugs here.

So just to get some closure on this error that we have seen around mistral executor and tripleo with python3: this was due to the ansible action that called subprocess which has a different implementation in python3 and so the monkeypatching needs to be adapted.

Review which fixes it for us is here: https://review.opendev.org/#/c/656901/

Damien and I think the nova_api/eventlet/mod_wsgi has a separate root-cause (although we have not spent all too much time on that one yet)

Right, after further investigation, it appears that the problem we saw under mod_wsgi was due to monkey patching, as Iain originally reported. It has nothing to do with our work on healthchecks. It turns out that running the AMQP heartbeat thread under mod_wsgi doesn't work when the threading library is monkey_patched, because the thread waits on a data structure [1] that has been monkey patched [2], which makes it yield its execution instead of sleeping for 15s. Because mod_wsgi stops the execution of its embedded interpreter, the AMQP heartbeat thread can't be resumed until there's a message to be processed in the mod_wsgi queue, which would resume the python interpreter and make eventlet resume the thread. Disabling monkey-patching in nova_api makes the scheduling issue go away. Note: other services like heat-api do not use monkey patching and aren't affected, so this seem to confirm that monkey-patching shouldn't happen in nova_api running under mod_wsgi in the first place. [1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_driv... [2] https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils...

Ben Nemec

7 May 7 May

9:45 p.m.

On 5/4/19 4:14 PM, Damien Ciabrini wrote:

...

On Fri, May 3, 2019 at 7:59 PM Michele Baldessari <michele@acksyn.org <mailto:michele@acksyn.org>> wrote:

On Mon, Apr 22, 2019 at 01:21:03PM -0500, Ben Nemec wrote: > > > On 4/22/19 12:53 PM, Alex Schultz wrote: > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec <openstack@nemebean.com <mailto:openstack@nemebean.com>> wrote: > > > > > > > > > > > > On 4/20/19 1:38 AM, Michele Baldessari wrote: > > > > On Fri, Apr 19, 2019 at 03:20:44PM -0700, iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com> wrote: > > > > > > > > > > Today I discovered that this problem appears to be caused by eventlet > > > > > monkey-patching. I've created a bug for it: > > > > > > > > > > https://bugs.launchpad.net/nova/+bug/1825584 > > > > > > > > Hi, > > > > > > > > just for completeness we see this very same issue also with > > > > mistral (actually it was the first service where we noticed the missed > > > > heartbeats). iirc Alex Schultz mentioned seeing it in ironic as well, > > > > although I have not personally observed it there yet. > > > > > > Is Mistral also mixing eventlet monkeypatching and WSGI? > > > > > > > Looks like there is monkey patching, however we noticed it with the > > engine/executor. So it's likely not just wsgi. I think I also saw it > > in the ironic-conductor, though I'd have to try it out again. I'll > > spin up an undercloud today and see if I can get a more complete list > > of affected services. It was pretty easy to reproduce. > > Okay, I asked because if there's no WSGI/Eventlet combination then this may > be different from the Nova issue that prompted this thread. It sounds like > that was being caused by a bad interaction between WSGI and some Eventlet > timers. If there's no WSGI involved then I wouldn't expect that to happen. > > I guess we'll see what further investigation turns up, but based on the > preliminary information there may be two bugs here.

So just to get some closure on this error that we have seen around mistral executor and tripleo with python3: this was due to the ansible action that called subprocess which has a different implementation in python3 and so the monkeypatching needs to be adapted.

Review which fixes it for us is here: https://review.opendev.org/#/c/656901/

Damien and I think the nova_api/eventlet/mod_wsgi has a separate root-cause (although we have not spent all too much time on that one yet)

Right, after further investigation, it appears that the problem we saw under mod_wsgi was due to monkey patching, as Iain originally reported. It has nothing to do with our work on healthchecks.

It turns out that running the AMQP heartbeat thread under mod_wsgi doesn't work when the threading library is monkey_patched, because the thread waits on a data structure [1] that has been monkey patched [2], which makes it yield its execution instead of sleeping for 15s.

Because mod_wsgi stops the execution of its embedded interpreter, the AMQP heartbeat thread can't be resumed until there's a message to be processed in the mod_wsgi queue, which would resume the python interpreter and make eventlet resume the thread.

Disabling monkey-patching in nova_api makes the scheduling issue go away.

This sounds like the right long-term solution, but it seems unlikely to be backportable to the existing releases. As I understand it some nova-api functionality has an actual dependency on monkey-patching. Is there a workaround? Maybe periodically poking the API to wake up the wsgi interpreter?

...

Note: other services like heat-api do not use monkey patching and aren't affected, so this seem to confirm that monkey-patching shouldn't happen in nova_api running under mod_wsgi in the first place.

[1] https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_driv... [2] https://github.com/openstack/oslo.utils/blob/master/oslo_utils/eventletutils...

iain.macdonnell＠oracle.com

10:22 p.m.

On 5/7/19 2:45 PM, Ben Nemec wrote:

...

On 5/4/19 4:14 PM, Damien Ciabrini wrote:

...
On Fri, May 3, 2019 at 7:59 PM Michele Baldessari <michele@acksyn.org <mailto:michele@acksyn.org>> wrote:

    On Mon, Apr 22, 2019 at 01:21:03PM -0500, Ben Nemec wrote:      >      >      > On 4/22/19 12:53 PM, Alex Schultz wrote:      > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec     <openstack@nemebean.com <mailto:openstack@nemebean.com>> wrote:      > > >      > > >      > > >      > > > On 4/20/19 1:38 AM, Michele Baldessari wrote:      > > > > On Fri, Apr 19, 2019 at 03:20:44PM -0700,     iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com> wrote:      > > > > >      > > > > > Today I discovered that this problem appears to be caused     by eventlet      > > > > > monkey-patching. I've created a bug for it:      > > > > >      > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=zgCsi2WthDNaeptBSW02iplSjxg9P_zrnfocp8P06oA&e=

     > > > >      > > > > Hi,      > > > >      > > > > just for completeness we see this very same issue also with      > > > > mistral (actually it was the first service where we noticed     the missed      > > > > heartbeats). iirc Alex Schultz mentioned seeing it in     ironic as well,      > > > > although I have not personally observed it there yet.      > > >      > > > Is Mistral also mixing eventlet monkeypatching and WSGI?      > > >      > >      > > Looks like there is monkey patching, however we noticed it with the      > > engine/executor. So it's likely not just wsgi. I think I also     saw it      > > in the ironic-conductor, though I'd have to try it out again. I'll      > > spin up an undercloud today and see if I can get a more     complete list      > > of affected services. It was pretty easy to reproduce.      >      > Okay, I asked because if there's no WSGI/Eventlet combination     then this may      > be different from the Nova issue that prompted this thread. It     sounds like      > that was being caused by a bad interaction between WSGI and some     Eventlet      > timers. If there's no WSGI involved then I wouldn't expect that     to happen.      >      > I guess we'll see what further investigation turns up, but based     on the      > preliminary information there may be two bugs here.

    So just to get some closure on this error that we have seen around     mistral executor and tripleo with python3: this was due to the ansible     action that called subprocess which has a different implementation in     python3 and so the monkeypatching needs to be adapted.

    Review which fixes it for us is here:

https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opendev.org_-23_c_656901_&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=1o81kC60gB8_5zIgi8WugZaOma_3m7grG4RQ-aVsbSE&e=

    Damien and I think the nova_api/eventlet/mod_wsgi has a separate     root-cause     (although we have not spent all too much time on that one yet)

Right, after further investigation, it appears that the problem we saw under mod_wsgi was due to monkey patching, as Iain originally reported. It has nothing to do with our work on healthchecks.

It turns out that running the AMQP heartbeat thread under mod_wsgi doesn't work when the threading library is monkey_patched, because the thread waits on a data structure [1] that has been monkey patched [2], which makes it yield its execution instead of sleeping for 15s.

Because mod_wsgi stops the execution of its embedded interpreter, the AMQP heartbeat thread can't be resumed until there's a message to be processed in the mod_wsgi queue, which would resume the python interpreter and make eventlet resume the thread.

Disabling monkey-patching in nova_api makes the scheduling issue go away.

This sounds like the right long-term solution, but it seems unlikely to be backportable to the existing releases. As I understand it some nova-api functionality has an actual dependency on monkey-patching. Is there a workaround? Maybe periodically poking the API to wake up the wsgi interpreter?

I've been pondering things like that ... but if I have multiple WSGI processes, can I be sure that an API-poke will hit the one(s) that need it? This is a road-block for me upgrading to Stein. I really don't want to have to go back to running nova-api standalone, but that's increasingly looking like the only "safe" option :/ ~iain

...

...
Note: other services like heat-api do not use monkey patching and aren't affected, so this seem to confirm that monkey-patching shouldn't happen in nova_api running under mod_wsgi in the first place.

[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openstack_oslo.messaging_blob_master_oslo-5Fmessaging_-5Fdrivers_impl-5Frabbit.py-23L904&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=O5nQh1r8Zmded00yYMXrfxL44xcd9KqFK-VOa0cg6gs&e=

[2] https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openstack_oslo.utils_blob_master_oslo-5Futils_eventletutils.py-23L182&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=QRkXCiqv6zcnO2b2p8Uv6cgRuu1R414B9SvILuugN6w&e=

melanie witt

8 May 8 May

8:15 a.m.

On Tue, 7 May 2019 15:22:36 -0700, Iain Macdonnell <iain.macdonnell@oracle.com> wrote:

...

On 5/7/19 2:45 PM, Ben Nemec wrote:

...
On 5/4/19 4:14 PM, Damien Ciabrini wrote:

...
On Fri, May 3, 2019 at 7:59 PM Michele Baldessari <michele@acksyn.org <mailto:michele@acksyn.org>> wrote:

    On Mon, Apr 22, 2019 at 01:21:03PM -0500, Ben Nemec wrote:      >      >      > On 4/22/19 12:53 PM, Alex Schultz wrote:      > > On Mon, Apr 22, 2019 at 11:28 AM Ben Nemec     <openstack@nemebean.com <mailto:openstack@nemebean.com>> wrote:      > > >      > > >      > > >      > > > On 4/20/19 1:38 AM, Michele Baldessari wrote:      > > > > On Fri, Apr 19, 2019 at 03:20:44PM -0700,     iain.macdonnell@oracle.com <mailto:iain.macdonnell@oracle.com> wrote:      > > > > >      > > > > > Today I discovered that this problem appears to be caused     by eventlet      > > > > > monkey-patching. I've created a bug for it:      > > > > >      > > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_nova_-2Bbug_1825584&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=zgCsi2WthDNaeptBSW02iplSjxg9P_zrnfocp8P06oA&e=

     > > > >      > > > > Hi,      > > > >      > > > > just for completeness we see this very same issue also with      > > > > mistral (actually it was the first service where we noticed     the missed      > > > > heartbeats). iirc Alex Schultz mentioned seeing it in     ironic as well,      > > > > although I have not personally observed it there yet.      > > >      > > > Is Mistral also mixing eventlet monkeypatching and WSGI?      > > >      > >      > > Looks like there is monkey patching, however we noticed it with the      > > engine/executor. So it's likely not just wsgi. I think I also     saw it      > > in the ironic-conductor, though I'd have to try it out again. I'll      > > spin up an undercloud today and see if I can get a more     complete list      > > of affected services. It was pretty easy to reproduce.      >      > Okay, I asked because if there's no WSGI/Eventlet combination     then this may      > be different from the Nova issue that prompted this thread. It     sounds like      > that was being caused by a bad interaction between WSGI and some     Eventlet      > timers. If there's no WSGI involved then I wouldn't expect that     to happen.      >      > I guess we'll see what further investigation turns up, but based     on the      > preliminary information there may be two bugs here.

    So just to get some closure on this error that we have seen around     mistral executor and tripleo with python3: this was due to the ansible     action that called subprocess which has a different implementation in     python3 and so the monkeypatching needs to be adapted.

    Review which fixes it for us is here:

https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opendev.org_-23_c_656901_&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=1o81kC60gB8_5zIgi8WugZaOma_3m7grG4RQ-aVsbSE&e=

    Damien and I think the nova_api/eventlet/mod_wsgi has a separate     root-cause     (although we have not spent all too much time on that one yet)

Right, after further investigation, it appears that the problem we saw under mod_wsgi was due to monkey patching, as Iain originally reported. It has nothing to do with our work on healthchecks.

It turns out that running the AMQP heartbeat thread under mod_wsgi doesn't work when the threading library is monkey_patched, because the thread waits on a data structure [1] that has been monkey patched [2], which makes it yield its execution instead of sleeping for 15s.

Because mod_wsgi stops the execution of its embedded interpreter, the AMQP heartbeat thread can't be resumed until there's a message to be processed in the mod_wsgi queue, which would resume the python interpreter and make eventlet resume the thread.

Disabling monkey-patching in nova_api makes the scheduling issue go away.

This sounds like the right long-term solution, but it seems unlikely to be backportable to the existing releases. As I understand it some nova-api functionality has an actual dependency on monkey-patching. Is there a workaround? Maybe periodically poking the API to wake up the wsgi interpreter?

I've been pondering things like that ... but if I have multiple WSGI processes, can I be sure that an API-poke will hit the one(s) that need it?

This is a road-block for me upgrading to Stein. I really don't want to have to go back to running nova-api standalone, but that's increasingly looking like the only "safe" option :/

FWIW, I have a patch series that aims to re-eliminate the eventlet dependency in nova-api: https://review.opendev.org/657750 (top patch) if you might be able to give it a try. If it helps, then maybe we could backport to Stein if folks are in support. -melanie

...

...
...
Note: other services like heat-api do not use monkey patching and aren't affected, so this seem to confirm that monkey-patching shouldn't happen in nova_api running under mod_wsgi in the first place.

[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openstack_oslo.messaging_blob_master_oslo-5Fmessaging_-5Fdrivers_impl-5Frabbit.py-23L904&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=O5nQh1r8Zmded00yYMXrfxL44xcd9KqFK-VOa0cg6gs&e=

[2] https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_openstack_oslo.utils_blob_master_oslo-5Futils_eventletutils.py-23L182&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=RxYkIjeLZPK2frXV_wEUCq8d3wvUIvDPimUcunMwbMs&m=vdmZv2wQnoFF1TIFnkN4XXdIjy0p4TKcsQ598Qbjti4&s=QRkXCiqv6zcnO2b2p8Uv6cgRuu1R414B9SvILuugN6w&e=

2290

Age (days ago)

2310

Last active (days ago)

List overview

Download

18 comments

8 participants

participants (8)

Alex Schultz
Ben Nemec
Damien Ciabrini
iain.macdonnell＠oracle.com
Ken Giusti
melanie witt
Michele Baldessari
Yatin Karel