[ops][neutron]After an upgrade to openstack queens, Neutron is unable to communicate properly with rabbitmq

Jean-Philippe Méthot

5 Jun 2019 5 Jun '19

10:01 a.m.

Hi, We had a Pike openstack setup that we updated to Queens earlier this week. It’s a 30 compute nodes infrastructure with 2 controller nodes and 2 network nodes, using openvswitch for networking. Since we upgraded to queens, neutron-server on the controller nodes has been unable to contact the openvswitch-agents through rabbitmq. The rabbitmq is clustered on both controller nodes and has been giving us the following error when neutron-server connections fail : =ERROR REPORT==== 5-Jun-2019::18:50:08 === closing AMQP connection <0.23859.0> (10.30.0.11:53198 -> 10.30.0.11:5672 - neutron-server:1170:ccf11f31-2b3b-414e-ab19-5ee2cf5dd15d): missed heartbeats from client, timeout: 60s The neutron-server logs show this error: 2019-06-05 18:50:33.132 1169 ERROR oslo.messaging._drivers.impl_rabbit [req-17167988-c6f2-475e-8b6a-90b92777e03a - - - - -] [b7684919-c98b-402e-90c3-59a0b5eccd1f] AMQP server on controller1:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer 2019-06-05 18:50:33.217 1169 ERROR oslo.messaging._drivers.impl_rabbit [-] [bd6900e0-ab7b-4139-920c-a456d7df023b] AMQP server on controller1:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <RecoverableConnectionError: unknown error> The relevant service version numbers are as follow: rabbitmq-server-3.6.5-1.el7.noarch openstack-neutron-12.0.6-1.el7.noarch python2-oslo-messaging-5.35.4-1.el7.noarch Rabbitmq does not show any alert. It also has plenty of memory and a high enough file limit. The login user and credentials are fine as they are used in other openstack services which can contact rabbitmq without issues. I’ve tried optimizing rabbitmq, upgrading, downgrading, increasing timeouts in neutron services, etc, to no avail. I find myself at a loss and would appreciate if anyone has any idea as to where to go from there. Best regards, Jean-Philippe Méthot Openstack system administrator Administrateur système Openstack PlanetHoster inc.

Attachments:

attachment.html (text/html — 5.0 KB)

Show replies by date

Brian Haley

5 Jun 5 Jun

12:09 p.m.

On 6/5/19 1:01 PM, Jean-Philippe Méthot wrote:

...

Hi,

We had a Pike openstack setup that we updated to Queens earlier this week. It’s a 30 compute nodes infrastructure with 2 controller nodes and 2 network nodes, using openvswitch for networking. Since we upgraded to queens, neutron-server on the controller nodes has been unable to contact the openvswitch-agents through rabbitmq. The rabbitmq is clustered on both controller nodes and has been giving us the following error when neutron-server connections fail :

=ERROR REPORT==== 5-Jun-2019::18:50:08 === closing AMQP connection <0.23859.0> (10.30.0.11:53198 -> 10.30.0.11:5672 - neutron-server:1170:ccf11f31-2b3b-414e-ab19-5ee2cf5dd15d): missed heartbeats from client, timeout: 60s

The neutron-server logs show this error:

2019-06-05 18:50:33.132 1169 ERROR oslo.messaging._drivers.impl_rabbit [req-17167988-c6f2-475e-8b6a-90b92777e03a - - - - -] [b7684919-c98b-402e-90c3-59a0b5eccd1f] AMQP server on controller1:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer 2019-06-05 18:50:33.217 1169 ERROR oslo.messaging._drivers.impl_rabbit [-] [bd6900e0-ab7b-4139-920c-a456d7df023b] AMQP server on controller1:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <RecoverableConnectionError: unknown error>

Are there possibly any firewall rules getting in the way? Connection reset by peer usually means the other end has sent a TCP Reset, which wouldn't happen if the permissions were wrong. As a test, does this connect? $ telnet controller1 5672 Trying $IP... Connected to controller1. Escape character is '^]'. -Brian

...

The relevant service version numbers are as follow: rabbitmq-server-3.6.5-1.el7.noarch openstack-neutron-12.0.6-1.el7.noarch python2-oslo-messaging-5.35.4-1.el7.noarch

Rabbitmq does not show any alert. It also has plenty of memory and a high enough file limit. The login user and credentials are fine as they are used in other openstack services which can contact rabbitmq without issues.

I’ve tried optimizing rabbitmq, upgrading, downgrading, increasing timeouts in neutron services, etc, to no avail. I find myself at a loss and would appreciate if anyone has any idea as to where to go from there.

Best regards,

Jean-Philippe Méthot Openstack system administrator Administrateur système Openstack PlanetHoster inc.

Jean-Philippe Méthot

12:31 p.m.

Hi, Thank you for your reply. There’s no firewall. However, we ended up figuring out that we were running out of tcp sockets. On a related note, we are still having issues but only with metadata fed through Neutron. Seems that it’s nova-api refusing the connection with http 500 error when the metadata-agent tries to connect to it. This is a completely different issue and may be more related to nova than neutron though, so it may very well not be the right mail thread to discuss it. Best regards, Jean-Philippe Méthot Openstack system administrator Administrateur système Openstack PlanetHoster inc.

...

Le 5 juin 2019 à 15:09, Brian Haley <haleyb.dev@gmail.com> a écrit :

On 6/5/19 1:01 PM, Jean-Philippe Méthot wrote:

...
Hi, We had a Pike openstack setup that we updated to Queens earlier this week. It’s a 30 compute nodes infrastructure with 2 controller nodes and 2 network nodes, using openvswitch for networking. Since we upgraded to queens, neutron-server on the controller nodes has been unable to contact the openvswitch-agents through rabbitmq. The rabbitmq is clustered on both controller nodes and has been giving us the following error when neutron-server connections fail : =ERROR REPORT==== 5-Jun-2019::18:50:08 === closing AMQP connection <0.23859.0> (10.30.0.11:53198 -> 10.30.0.11:5672 - neutron-server:1170:ccf11f31-2b3b-414e-ab19-5ee2cf5dd15d): missed heartbeats from client, timeout: 60s The neutron-server logs show this error: 2019-06-05 18:50:33.132 1169 ERROR oslo.messaging._drivers.impl_rabbit [req-17167988-c6f2-475e-8b6a-90b92777e03a - - - - -] [b7684919-c98b-402e-90c3-59a0b5eccd1f] AMQP server on controller1:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer 2019-06-05 18:50:33.217 1169 ERROR oslo.messaging._drivers.impl_rabbit [-] [bd6900e0-ab7b-4139-920c-a456d7df023b] AMQP server on controller1:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <RecoverableConnectionError: unknown error>

Are there possibly any firewall rules getting in the way? Connection reset by peer usually means the other end has sent a TCP Reset, which wouldn't happen if the permissions were wrong.

As a test, does this connect?

$ telnet controller1 5672 Trying $IP... Connected to controller1. Escape character is '^]'.

-Brian

...
The relevant service version numbers are as follow: rabbitmq-server-3.6.5-1.el7.noarch openstack-neutron-12.0.6-1.el7.noarch python2-oslo-messaging-5.35.4-1.el7.noarch Rabbitmq does not show any alert. It also has plenty of memory and a high enough file limit. The login user and credentials are fine as they are used in other openstack services which can contact rabbitmq without issues. I’ve tried optimizing rabbitmq, upgrading, downgrading, increasing timeouts in neutron services, etc, to no avail. I find myself at a loss and would appreciate if anyone has any idea as to where to go from there. Best regards, Jean-Philippe Méthot Openstack system administrator Administrateur système Openstack PlanetHoster inc.

Mathieu Gagné

12:31 p.m.

Hi Jean-Philippe, On Wed, Jun 5, 2019 at 1:01 PM Jean-Philippe Méthot <jp.methot@planethoster.info> wrote:

...

We had a Pike openstack setup that we updated to Queens earlier this week. It’s a 30 compute nodes infrastructure with 2 controller nodes and 2 network nodes, using openvswitch for networking. Since we upgraded to queens, neutron-server on the controller nodes has been unable to contact the openvswitch-agents through rabbitmq. The rabbitmq is clustered on both controller nodes and has been giving us the following error when neutron-server connections fail :

=ERROR REPORT==== 5-Jun-2019::18:50:08 === closing AMQP connection <0.23859.0> (10.30.0.11:53198 -> 10.30.0.11:5672 - neutron-server:1170:ccf11f31-2b3b-414e-ab19-5ee2cf5dd15d): missed heartbeats from client, timeout: 60s

The neutron-server logs show this error:

2019-06-05 18:50:33.132 1169 ERROR oslo.messaging._drivers.impl_rabbit [req-17167988-c6f2-475e-8b6a-90b92777e03a - - - - -] [b7684919-c98b-402e-90c3-59a0b5eccd1f] AMQP server on controller1:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer 2019-06-05 18:50:33.217 1169 ERROR oslo.messaging._drivers.impl_rabbit [-] [bd6900e0-ab7b-4139-920c-a456d7df023b] AMQP server on controller1:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <RecoverableConnectionError: unknown error>

The relevant service version numbers are as follow: rabbitmq-server-3.6.5-1.el7.noarch openstack-neutron-12.0.6-1.el7.noarch python2-oslo-messaging-5.35.4-1.el7.noarch

Rabbitmq does not show any alert. It also has plenty of memory and a high enough file limit. The login user and credentials are fine as they are used in other openstack services which can contact rabbitmq without issues.

I’ve tried optimizing rabbitmq, upgrading, downgrading, increasing timeouts in neutron services, etc, to no avail. I find myself at a loss and would appreciate if anyone has any idea as to where to go from there.

We had a very similar issue after upgrading to Neutron Queens. In fact, all Neutron agents were "down" according to status API and messages weren't getting through. IIRC, this only happened in regions which had more load than the others. We applied a bunch of fixes which I suspect are only a bunch of bandaids. Here are the changes we made: * Split neutron-api from neutron-server. Create a whole new controller running neutron-api with mod_wsgi. * Increase [database]/max_overflow = 200 * Disable RabbitMQ heartbeat in oslo.messaging: [oslo_messaging_rabbit]/heartbeat_timeout_threshold = 0 * Increase [agent]/report_interval = 120 * Increase [DEFAULT]/agent_down_time = 600 We also have those sysctl configs due to firewall dropping sessions. But those have been on the server forever: net.ipv4.tcp_keepalive_time = 30 net.ipv4.tcp_keepalive_intvl = 1 net.ipv4.tcp_keepalive_probes = 5 We never figured out why a service that was working before the upgrade but no longer is. This is kind of frustrating as it caused us all short of intermittent issues and stress during our upgrade. Hope this helps. -- Mathieu

Erik Olof Gunnar Andersson

4:21 p.m.

We have experienced similar issues when upgrading from Mitaka to Rocky. Distributing the RabbitMQ connections between the Rabbits helps a lot. At least with larger deployments. Since not all services re-connecting will be establishing it's connections against a single RabbitMQ server.

...

oslo_messaging_rabbit/kombu_failover_strategy = shuffle

An alternative is to increase the SSL (and/or TCP) acceptors on RabbitMQ to allow it to process new connections faster.

...

num_tcp_acceptors / num_ssl_acceptors https://github.com/rabbitmq/rabbitmq-server/blob/master/docs/rabbitmq.config... https://groups.google.com/forum/#!topic/rabbitmq-users/0ApuN2ES0Ks

...

We had a very similar issue after upgrading to Neutron Queens. In fact, all Neutron agents were "down" according to status API and messages weren't getting through. IIRC, this only happened in regions which had more load than the others.

We haven't quite figured this one out yet, but just after upgrade, Neutron handles about 1-2 of these per second. Restarting Neutron and it consumes messages super-fast for a few minutes and then slows down again. A few hours after the upgrade it consumes these without an issue. We ended up making similar tuning

...

report_interval 60 agent_down_time 150

The most problematic for us so far has been the memory usage of Neutron. We see it peak at 8.2GB for neutron-server (rpc) instances. Which means we can only have ~10 neutron-rpc workers on a 128GB machine. Best Regards, Erik Olof Gunnar Andersson -----Original Message----- From: Mathieu Gagné <mgagne@calavera.ca> Sent: Wednesday, June 5, 2019 12:32 PM To: Jean-Philippe Méthot <jp.methot@planethoster.info> Cc: openstack-discuss@lists.openstack.org Subject: Re: [ops][neutron]After an upgrade to openstack queens, Neutron is unable to communicate properly with rabbitmq Hi Jean-Philippe, On Wed, Jun 5, 2019 at 1:01 PM Jean-Philippe Méthot <jp.methot@planethoster.info> wrote:

...

We had a Pike openstack setup that we updated to Queens earlier this week. It’s a 30 compute nodes infrastructure with 2 controller nodes and 2 network nodes, using openvswitch for networking. Since we upgraded to queens, neutron-server on the controller nodes has been unable to contact the openvswitch-agents through rabbitmq. The rabbitmq is clustered on both controller nodes and has been giving us the following error when neutron-server connections fail :

=ERROR REPORT==== 5-Jun-2019::18:50:08 === closing AMQP connection <0.23859.0> (10.30.0.11:53198 -> 10.30.0.11:5672 - neutron-server:1170:ccf11f31-2b3b-414e-ab19-5ee2cf5dd15d): missed heartbeats from client, timeout: 60s

The neutron-server logs show this error:

2019-06-05 18:50:33.132 1169 ERROR oslo.messaging._drivers.impl_rabbit [req-17167988-c6f2-475e-8b6a-90b92777e03a - - - - -] [b7684919-c98b-402e-90c3-59a0b5eccd1f] AMQP server on controller1:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer 2019-06-05 18:50:33.217 1169 ERROR oslo.messaging._drivers.impl_rabbit [-] [bd6900e0-ab7b-4139-920c-a456d7df023b] AMQP server on controller1:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <RecoverableConnectionError: unknown error>

The relevant service version numbers are as follow: rabbitmq-server-3.6.5-1.el7.noarch openstack-neutron-12.0.6-1.el7.noarch python2-oslo-messaging-5.35.4-1.el7.noarch

Rabbitmq does not show any alert. It also has plenty of memory and a high enough file limit. The login user and credentials are fine as they are used in other openstack services which can contact rabbitmq without issues.

I’ve tried optimizing rabbitmq, upgrading, downgrading, increasing timeouts in neutron services, etc, to no avail. I find myself at a loss and would appreciate if anyone has any idea as to where to go from there.

Jean-Philippe Méthot

10 Jun 10 Jun

9:52 a.m.

Hi, Can you give me an idea of how you split the API from the server part? I’m guessing it has to do with pointing the API endpoint to a specific server, but keeping the neutron info in config files pointing to the controller? Contrary to what I said on this thread last week, we’ve been plagued with this issue every 24 hours or so, needing to restart the controller nodes to restore stability. We did implement several of the tweaks that were suggested in this thread’s previous emails, but we are only now considering splitting the API from the main servers, as you did. Jean-Philippe Méthot Openstack system administrator Administrateur système Openstack PlanetHoster inc.

...

Le 5 juin 2019 à 15:31, Mathieu Gagné <mgagne@calavera.ca> a écrit :

Hi Jean-Philippe,

On Wed, Jun 5, 2019 at 1:01 PM Jean-Philippe Méthot <jp.methot@planethoster.info> wrote:

...
We had a Pike openstack setup that we updated to Queens earlier this week. It’s a 30 compute nodes infrastructure with 2 controller nodes and 2 network nodes, using openvswitch for networking. Since we upgraded to queens, neutron-server on the controller nodes has been unable to contact the openvswitch-agents through rabbitmq. The rabbitmq is clustered on both controller nodes and has been giving us the following error when neutron-server connections fail :

=ERROR REPORT==== 5-Jun-2019::18:50:08 === closing AMQP connection <0.23859.0> (10.30.0.11:53198 -> 10.30.0.11:5672 - neutron-server:1170:ccf11f31-2b3b-414e-ab19-5ee2cf5dd15d): missed heartbeats from client, timeout: 60s

The neutron-server logs show this error:

2019-06-05 18:50:33.132 1169 ERROR oslo.messaging._drivers.impl_rabbit [req-17167988-c6f2-475e-8b6a-90b92777e03a - - - - -] [b7684919-c98b-402e-90c3-59a0b5eccd1f] AMQP server on controller1:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer 2019-06-05 18:50:33.217 1169 ERROR oslo.messaging._drivers.impl_rabbit [-] [bd6900e0-ab7b-4139-920c-a456d7df023b] AMQP server on controller1:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <RecoverableConnectionError: unknown error>

The relevant service version numbers are as follow: rabbitmq-server-3.6.5-1.el7.noarch openstack-neutron-12.0.6-1.el7.noarch python2-oslo-messaging-5.35.4-1.el7.noarch

Rabbitmq does not show any alert. It also has plenty of memory and a high enough file limit. The login user and credentials are fine as they are used in other openstack services which can contact rabbitmq without issues.

I’ve tried optimizing rabbitmq, upgrading, downgrading, increasing timeouts in neutron services, etc, to no avail. I find myself at a loss and would appreciate if anyone has any idea as to where to go from there.

We had a very similar issue after upgrading to Neutron Queens. In fact, all Neutron agents were "down" according to status API and messages weren't getting through. IIRC, this only happened in regions which had more load than the others.

We applied a bunch of fixes which I suspect are only a bunch of bandaids.

Here are the changes we made: * Split neutron-api from neutron-server. Create a whole new controller running neutron-api with mod_wsgi. * Increase [database]/max_overflow = 200 * Disable RabbitMQ heartbeat in oslo.messaging: [oslo_messaging_rabbit]/heartbeat_timeout_threshold = 0 * Increase [agent]/report_interval = 120 * Increase [DEFAULT]/agent_down_time = 600

We also have those sysctl configs due to firewall dropping sessions. But those have been on the server forever: net.ipv4.tcp_keepalive_time = 30 net.ipv4.tcp_keepalive_intvl = 1 net.ipv4.tcp_keepalive_probes = 5

We never figured out why a service that was working before the upgrade but no longer is. This is kind of frustrating as it caused us all short of intermittent issues and stress during our upgrade.

Hope this helps.

-- Mathieu

Mathieu Gagné

10:25 a.m.

Hi, On Mon, Jun 10, 2019 at 12:53 PM Jean-Philippe Méthot <jp.methot@planethoster.info> wrote:

...

Hi,

Can you give me an idea of how you split the API from the server part? I’m guessing it has to do with pointing the API endpoint to a specific server, but keeping the neutron info in config files pointing to the controller?

Contrary to what I said on this thread last week, we’ve been plagued with this issue every 24 hours or so, needing to restart the controller nodes to restore stability. We did implement several of the tweaks that were suggested in this thread’s previous emails, but we are only now considering splitting the API from the main servers, as you did.

I followed this procedure to use mod_wsgi and updated DNS to point to the new machine/IP: https://docs.openstack.org/neutron/rocky/admin/config-wsgi.html#neutron-api-... You can run neutron-rpc-server if you want to remove the API part from neutron-server. Mathieu

...

Jean-Philippe Méthot Openstack system administrator Administrateur système Openstack PlanetHoster inc.

Le 5 juin 2019 à 15:31, Mathieu Gagné <mgagne@calavera.ca> a écrit :

Hi Jean-Philippe,

On Wed, Jun 5, 2019 at 1:01 PM Jean-Philippe Méthot <jp.methot@planethoster.info> wrote:

We had a Pike openstack setup that we updated to Queens earlier this week. It’s a 30 compute nodes infrastructure with 2 controller nodes and 2 network nodes, using openvswitch for networking. Since we upgraded to queens, neutron-server on the controller nodes has been unable to contact the openvswitch-agents through rabbitmq. The rabbitmq is clustered on both controller nodes and has been giving us the following error when neutron-server connections fail :

=ERROR REPORT==== 5-Jun-2019::18:50:08 === closing AMQP connection <0.23859.0> (10.30.0.11:53198 -> 10.30.0.11:5672 - neutron-server:1170:ccf11f31-2b3b-414e-ab19-5ee2cf5dd15d): missed heartbeats from client, timeout: 60s

The neutron-server logs show this error:

2019-06-05 18:50:33.132 1169 ERROR oslo.messaging._drivers.impl_rabbit [req-17167988-c6f2-475e-8b6a-90b92777e03a - - - - -] [b7684919-c98b-402e-90c3-59a0b5eccd1f] AMQP server on controller1:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer 2019-06-05 18:50:33.217 1169 ERROR oslo.messaging._drivers.impl_rabbit [-] [bd6900e0-ab7b-4139-920c-a456d7df023b] AMQP server on controller1:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <RecoverableConnectionError: unknown error>

The relevant service version numbers are as follow: rabbitmq-server-3.6.5-1.el7.noarch openstack-neutron-12.0.6-1.el7.noarch python2-oslo-messaging-5.35.4-1.el7.noarch

Rabbitmq does not show any alert. It also has plenty of memory and a high enough file limit. The login user and credentials are fine as they are used in other openstack services which can contact rabbitmq without issues.

I’ve tried optimizing rabbitmq, upgrading, downgrading, increasing timeouts in neutron services, etc, to no avail. I find myself at a loss and would appreciate if anyone has any idea as to where to go from there.

We had a very similar issue after upgrading to Neutron Queens. In fact, all Neutron agents were "down" according to status API and messages weren't getting through. IIRC, this only happened in regions which had more load than the others.

We applied a bunch of fixes which I suspect are only a bunch of bandaids.

Here are the changes we made: * Split neutron-api from neutron-server. Create a whole new controller running neutron-api with mod_wsgi. * Increase [database]/max_overflow = 200 * Disable RabbitMQ heartbeat in oslo.messaging: [oslo_messaging_rabbit]/heartbeat_timeout_threshold = 0 * Increase [agent]/report_interval = 120 * Increase [DEFAULT]/agent_down_time = 600

We also have those sysctl configs due to firewall dropping sessions. But those have been on the server forever: net.ipv4.tcp_keepalive_time = 30 net.ipv4.tcp_keepalive_intvl = 1 net.ipv4.tcp_keepalive_probes = 5

We never figured out why a service that was working before the upgrade but no longer is. This is kind of frustrating as it caused us all short of intermittent issues and stress during our upgrade.

Hope this helps.

-- Mathieu

2324

Age (days ago)

2329

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Brian Haley
Erik Olof Gunnar Andersson
Jean-Philippe Méthot
Mathieu Gagné