Open Stack

Wed Sep 10 20:13:55 UTC 2014

Hi,

I'm running Havana and I'm seeing some less-than-ideal behaviour on rabbitmq failover.  I'd like to figure out if this is expected behaviour or if something is going wrong.

We're running rabbitmq in active/standby mode with DRBD storage.  On the controller the timeline looks like this:

07:25:38 rabbit starts going inactive on current active controller
07:25:47 rabbit fully disabled
07:25:50 rabbit starts going active on other controller
07:25:53 rabbit fully enabled

On the compute node, I've included the nova-compute logs below.  We can see that the "model server went away" log doesn't come out until long after the connection was reset, and then the "Recovered model server connection" log comes out right away.  In a controlled shutdown case like this would we expect the RPC "call" response to come back successfully?  If so, then why didn't it?   If not, then wouldn't it make sense to fail the call immediately rather than wait for it to time out?

As it stands, it seems that waiting for the RPC call to time out blocks _report_state() from running again in report_interval seconds, which delays the service update until the RPC timeout period expires.

2014-08-18 07:25:46.091 16126 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: [Errno 104] Connection reset by peer
2014-08-18 07:25:46.092 16126 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-08-18 07:25:46.113 16126 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: [Errno 104] Connection reset by peer
2014-08-18 07:25:46.113 16126 TRACE nova.openstack.common.rpc.common error: [Errno 104] Connection reset by peer
2014-08-18 07:25:46.113 16126 TRACE nova.openstack.common.rpc.common 
2014-08-18 07:25:46.114 16126 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-08-18 07:25:46.132 16126 ERROR nova.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2014-08-18 07:25:46.132 16126 ERROR nova.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2014-08-18 07:25:47.134 16126 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-08-18 07:25:47.140 16126 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-08-18 07:25:47.146 16126 ERROR nova.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2014-08-18 07:25:47.147 16126 ERROR nova.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 3 seconds.
2014-08-18 07:25:50.148 16126 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-08-18 07:25:50.154 16126 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-08-18 07:25:50.160 16126 ERROR nova.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.
2014-08-18 07:25:50.161 16126 ERROR nova.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 5 seconds.
2014-08-18 07:25:55.161 16126 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-08-18 07:25:55.167 16126 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-08-18 07:25:55.186 16126 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-08-18 07:25:55.190 16126 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-08-18 07:26:15.793 16126 ERROR nova.openstack.common.rpc.common [-] Failed to publish message to topic 'conductor': [Errno 104] Connection reset by peer
2014-08-18 07:26:15.793 16126 TRACE nova.openstack.common.rpc.common error: [Errno 104] Connection reset by peer
2014-08-18 07:26:15.793 16126 TRACE nova.openstack.common.rpc.common 
2014-08-18 07:26:15.795 16126 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2014-08-18 07:26:15.803 16126 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-08-18 07:26:15.812 16126 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2014-08-18 07:26:16.101 16126 AUDIT nova.compute.resource_tracker [-] Free ram (MB): 25812, per-node: [11318, 14794], numa nodes:2
2014-08-18 07:26:16.101 16126 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 36
2014-08-18 07:26:16.101 16126 AUDIT nova.compute.resource_tracker [-] Free vcpus: 289, free per-node float vcpus: [112, 160], free per-node pinned vcpus: [7, 10]
2014-08-18 07:26:16.102 16126 INFO nova.compute.resource_tracker [-] DETAIL: pcpus:17, Free vcpus:289, 16.0x overcommit, per-cpu float cpulist: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
2014-08-18 07:26:16.125 16126 INFO nova.compute.resource_tracker [-] Compute_service record updated for compute-0:compute-0
2014-08-18 07:26:41.764 16126 ERROR nova.servicegroup.drivers.db [-] model server went away
2014-08-18 07:26:41.764 16126 TRACE nova.servicegroup.drivers.db Traceback (most recent call last):
2014-08-18 07:26:41.764 16126 TRACE nova.servicegroup.drivers.db   File "./usr/lib64/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 107, in _report_state
2014-08-18 07:26:41.764 16126 TRACE nova.servicegroup.drivers.db   File "./usr/lib64/python2.7/site-packages/nova/conductor/api.py", line 269, in service_update
2014-08-18 07:26:41.764 16126 TRACE nova.servicegroup.drivers.db   File "./usr/lib64/python2.7/site-packages/nova/conductor/rpcapi.py", line 397, in service_update
2014-08-18 07:26:41.764 16126 TRACE nova.servicegroup.drivers.db   File "./usr/lib64/python2.7/site-packages/nova/rpcclient.py", line 85, in call
2014-08-18 07:26:41.764 16126 TRACE nova.servicegroup.drivers.db   File "./usr/lib64/python2.7/site-packages/nova/rpcclient.py", line 63, in _invoke
2014-08-18 07:26:41.764 16126 TRACE nova.servicegroup.drivers.db   File "./usr/lib64/python2.7/site-packages/nova/openstack/common/rpc/proxy.py", line 131, in call
2014-08-18 07:26:41.764 16126 TRACE nova.servicegroup.drivers.db Timeout: Timeout while waiting on RPC response - topic: "conductor", RPC method: "service_update" info: "<unknown>"
2014-08-18 07:26:41.764 16126 TRACE nova.servicegroup.drivers.db 
2014-08-18 07:26:41.766 16126 WARNING nova.openstack.common.loopingcall [-] task run outlasted interval by 50.016114 sec
2014-08-18 07:26:41.774 16126 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-08-18 07:26:41.784 16126 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.204.2:5672
2014-08-18 07:26:41.794 16126 ERROR nova.servicegroup.drivers.db [-] Recovered model server connection!

Thanks,
Chris

Open Stack

[openstack-dev] [nova] expected behaviour of _report_state() on rabbitmq failover

OpenStack

Community

Documentation

Branding & Legal