[openstack-dev] [Oslo] [Oslo.messaging] RPC failover handling in rabbitmq driver

Gordon Sim gsim at redhat.com
Mon Jul 28 09:58:02 UTC 2014


On 07/28/2014 09:20 AM, Bogdan Dobrelya wrote:
> Hello.
> I'd like to bring your attention to major RPC failover issue in
> impl_rabbit.py [0]. There are several *related* patches and a number of
> concerns should be considered as well:
> - Passive exchanges fix [1] (looks like the problem is much deeper than
> it seems though).
> - the first version of the fix [2] which makes the producer to declare a
> queue and bind it to exchange as well as consumer does.
> - Making all RPC involved reply_* queues durable in order to preserve
> them in RabbitMQ after failover (there could be a TTL for such a queues
> as well)
> - RPC throughput tuning patch [3]
>
> I believe the issue [0] should be at least prioritized and assigned to
> some milestone.

I think the real issue is the lack of clarity around what guarantees are 
made by the API.

Is it the case that an RPC call should never fail (i.e. never time out) 
due to failover? Either way, the answer to this should be very clear.

If failures may occur, then the calling code needs to handle that. If 
eliminating failures is part of the 'contract' then the library should 
have a clear strategy for ensuring (and testing) this.

Another possible scenario is that the connection is lost immediately 
after writing the request message to the socket (but before it is 
processed by the rabbit broker). In this case the issue is that the 
request is not confirmed, so it can complete before it is 'safe'. In 
other words requests are unreliable.

My own view is that if you want to avoid time outs on failover, the best 
approach is to have olso.messaging retry the entire request regardless 
of the point it had reached in the previous attempt. I.e. rather than 
trying to make delivery of responses reliable, assume that both requests 
and responses are unreliable and re-issue the request immediately on 
failover. (The retry logic could even be made independent of any driver 
if desired).

This is perhaps a bigger change, but I think it is more easy to get 
right and will also be more scalable and performant since it doesn't 
require replication of every queue and every message.


>
> [0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732
> [1] https://review.openstack.org/#/c/109373/
> [2]
> https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e
> [3] https://review.openstack.org/#/c/109143/
>




More information about the OpenStack-dev mailing list