[ops][neutron]After an upgrade to openstack queens, Neutron is unable to communicate properly with rabbitmq

Mathieu Gagné mgagne at calavera.ca
Mon Jun 10 17:25:43 UTC 2019


Hi,

On Mon, Jun 10, 2019 at 12:53 PM Jean-Philippe Méthot
<jp.methot at planethoster.info> wrote:
>
> Hi,
>
> Can you give me an idea of how you split the API from the server part? I’m guessing it has to do with pointing the API endpoint to a specific server, but keeping the neutron info in config files pointing to the controller?
>
> Contrary to what I said on this thread last week, we’ve been plagued with this issue every 24 hours or so, needing to restart the controller nodes to restore stability. We did implement several of the tweaks that were suggested in this thread’s previous emails, but we are only now considering splitting the API from the main servers, as you did.
>

I followed this procedure to use mod_wsgi and updated DNS to point to
the new machine/IP:
https://docs.openstack.org/neutron/rocky/admin/config-wsgi.html#neutron-api-behind-mod-wsgi

You can run neutron-rpc-server if you want to remove the API part from
neutron-server.

Mathieu

>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> Le 5 juin 2019 à 15:31, Mathieu Gagné <mgagne at calavera.ca> a écrit :
>
> Hi Jean-Philippe,
>
> On Wed, Jun 5, 2019 at 1:01 PM Jean-Philippe Méthot
> <jp.methot at planethoster.info> wrote:
>
>
> We had a Pike openstack setup that we updated to Queens earlier this week. It’s a 30 compute nodes infrastructure with 2 controller nodes and 2 network nodes, using openvswitch for networking. Since we upgraded to queens, neutron-server on the controller nodes has been unable to contact the openvswitch-agents through rabbitmq. The rabbitmq is clustered on both controller nodes and has been giving us the following error when neutron-server connections fail :
>
> =ERROR REPORT==== 5-Jun-2019::18:50:08 ===
> closing AMQP connection <0.23859.0> (10.30.0.11:53198 -> 10.30.0.11:5672 - neutron-server:1170:ccf11f31-2b3b-414e-ab19-5ee2cf5dd15d):
> missed heartbeats from client, timeout: 60s
>
> The neutron-server logs show this error:
>
> 2019-06-05 18:50:33.132 1169 ERROR oslo.messaging._drivers.impl_rabbit [req-17167988-c6f2-475e-8b6a-90b92777e03a - - - - -] [b7684919-c98b-402e-90c3-59a0b5eccd1f] AMQP server on controller1:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: error: [Errno 104] Connection reset by peer
> 2019-06-05 18:50:33.217 1169 ERROR oslo.messaging._drivers.impl_rabbit [-] [bd6900e0-ab7b-4139-920c-a456d7df023b] AMQP server on controller1:5672 is unreachable: <RecoverableConnectionError: unknown error>. Trying again in 1 seconds.: RecoverableConnectionError: <RecoverableConnectionError: unknown error>
>
> The relevant service version numbers are as follow:
> rabbitmq-server-3.6.5-1.el7.noarch
> openstack-neutron-12.0.6-1.el7.noarch
> python2-oslo-messaging-5.35.4-1.el7.noarch
>
> Rabbitmq does not show any alert. It also has plenty of memory and a high enough file limit. The login user and credentials are fine as they are used in other openstack services which can contact rabbitmq without issues.
>
> I’ve tried optimizing rabbitmq, upgrading, downgrading, increasing timeouts in neutron services, etc, to no avail. I find myself at a loss and would appreciate if anyone has any idea as to where to go from there.
>
>
> We had a very similar issue after upgrading to Neutron Queens. In
> fact, all Neutron agents were "down" according to status API and
> messages weren't getting through. IIRC, this only happened in regions
> which had more load than the others.
>
> We applied a bunch of fixes which I suspect are only a bunch of bandaids.
>
> Here are the changes we made:
> * Split neutron-api from neutron-server. Create a whole new controller
> running neutron-api with mod_wsgi.
> * Increase [database]/max_overflow = 200
> * Disable RabbitMQ heartbeat in oslo.messaging:
> [oslo_messaging_rabbit]/heartbeat_timeout_threshold = 0
> * Increase [agent]/report_interval = 120
> * Increase [DEFAULT]/agent_down_time = 600
>
> We also have those sysctl configs due to firewall dropping sessions.
> But those have been on the server forever:
> net.ipv4.tcp_keepalive_time = 30
> net.ipv4.tcp_keepalive_intvl = 1
> net.ipv4.tcp_keepalive_probes = 5
>
> We never figured out why a service that was working before the upgrade
> but no longer is.
> This is kind of frustrating as it caused us all short of intermittent
> issues and stress during our upgrade.
>
> Hope this helps.
>
> --
> Mathieu
>
>



More information about the openstack-discuss mailing list