[openstack-dev] [Oslo] [Oslo.messaging] RPC failover handling in rabbitmq driver

Ken Giusti kgiusti at gmail.com
Mon Jul 28 15:49:22 UTC 2014


On Mon, 28 Jul 2014 10:58:02 +0100, Gordon Sim wrote:
> On 07/28/2014 09:20 AM, Bogdan Dobrelya wrote:
> > Hello.
> > I'd like to bring your attention to major RPC failover issue in
> > impl_rabbit.py [0]. There are several *related* patches and a number of
> > concerns should be considered as well:
> > - Passive exchanges fix [1] (looks like the problem is much deeper than
> > it seems though).
> > - the first version of the fix [2] which makes the producer to declare a
> > queue and bind it to exchange as well as consumer does.
> > - Making all RPC involved reply_* queues durable in order to preserve
> > them in RabbitMQ after failover (there could be a TTL for such a queues
> > as well)
> > - RPC throughput tuning patch [3]
> >
> > I believe the issue [0] should be at least prioritized and assigned to
> > some milestone.
>
> I think the real issue is the lack of clarity around what guarantees are
> made by the API.
>

Wholeheartedly agree!  This lack of explicitness makes it very
difficult to add new messaging backends (drivers) to oslo.messaging
and expect the API to function uniformly from the application's point
of view.  The end result is that oslo.messaging's API behavior is
somewhat implicitly defined by the characteristics of the rpc backend
(broker), rather than olso.messaging itself.

In other words: we need to solve this problem in general, not just for
the rabbit driver.

> Is it the case that an RPC call should never fail (i.e. never time out)
> due to failover? Either way, the answer to this should be very clear.
>
> If failures may occur, then the calling code needs to handle that. If
> eliminating failures is part of the 'contract' then the library should
> have a clear strategy for ensuring (and testing) this.
>
> Another possible scenario is that the connection is lost immediately
> after writing the request message to the socket (but before it is
> processed by the rabbit broker). In this case the issue is that the
> request is not confirmed, so it can complete before it is 'safe'. In
> other words requests are unreliable.
>
> My own view is that if you want to avoid time outs on failover, the best
> approach is to have olso.messaging retry the entire request regardless
> of the point it had reached in the previous attempt. I.e. rather than
> trying to make delivery of responses reliable, assume that both requests
> and responses are unreliable and re-issue the request immediately on
> failover.

I like this suggestion. By assuming limited reliability from the
underlying messaging system, we reduce oslo.messaging's reliance on
features provided by any particular messaging implementation
(driver/broker).

> (The retry logic could even be made independent of any driver
> if desired).

Exactly!  Having all QOS related code outside of the drivers would
guarantee that the behavior of the API is _uniform_ across all
drivers.

>
> This is perhaps a bigger change, but I think it is more easy to get
> right and will also be more scalable and performant since it doesn't
> require replication of every queue and every message.
>
>
> >
> > [0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732
> > [1] https://review.openstack.org/#/c/109373/
> > [2]
> > https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e
> > [3] https://review.openstack.org/#/c/109143/
>


-- 
Ken Giusti  (kgiusti at gmail.com)



More information about the OpenStack-dev mailing list