[openstack-dev] [oslo.messaging][zeromq] Next step

Bogdan Dobrelya bdobrelia at mirantis.com
Wed Jul 8 15:23:18 UTC 2015

>> On 6/12/15, 3:55 PM, "Clint Byrum" <cl... at fewbar.com> wrote:
>> >
>> >> 
>> >
>> >I think you missed "it is not tested in the gate" as a root cause for
>> >some of the ambiguity. Anecdotes and bug reports are super important for
>> >knowing where to invest next, but a test suite would at least establish a
>> >base line and prevent the sort of thrashing and confusion that comes from
>> >such a diverse community of users feeding bug reports into the system.
>> I agree with you that zmq needs to pass whatever oslo messaging test is
>> currently available however this won't remove all the
>> semantical/behavioral ambiguities.
>> This kind of ambiguities could be fixed by enhancing the API documentation
>> - always good to do even if a bit late - and by developing the associated
>> test cases (although they tend to be harder to write).
>> Another (ugly) strategy could be to simply say that the intended behavior
>> is the one exposed by the rabbitMQ based implementation (by means of
>> seniority and/or actual deployment mileage).
>> For example, what happens if a recipient of a CALL or CAST message dies
>> before the message is sent.
>> Is the API supposed to return an error and if yes how quickly? RabbitMQ
>> based implementation will
>> likely return a success (since the message will sit in a queue in the
>> broker until the consumer reconnects - which could be a long time) while
>> ZMQ based will depend on the type of pattern used. Which is the behavior
>> desired by apps and which is the behavior "advertised" by the oslo
>> messaging API?
>> Another example relates to flow control conditions (sender sends lots of
>> CAST, receiver very slow to consume). Should the sender
>> - always receive success and all messages will be queued without limit,
>> - always receive success and all messages will be queued up to a certain
>> point and new messages will be dropped silently
>> - or receive an EAGAIN error (socket behavior)?
>> In these unclear conditions, switching to a different transport driver is
>> going to be tricky because apps may have been written/patched to assume a
>> certain behavior and might no longer behave properly if the expected
>> behavior changes (even if it is for the better) and may require adjusting
>> existing apps (to support a different behavior of the API).
>> Note that "switching to a different transport" is not just about testing
>> it in devstack but also about deploying it at scale on real production
>> environment and testing at scale.
> Alec, you bring up fantastic and importan points above.
> However, lets stay on track. We're not even testing to see if nova-api
> can talk to nova-conductor via the current zmq driver. It's entirely
> possible it simply does not work for any number of reasons.
> A devstack-gate job is the _minimum_ needed.

I believe the next steps can be summarized as the following:

1) Make existing zeromq driver tested in the gate in order to save it
from deprecation and removal.

2) Think of the new driver architecture decisions more precisely, like:
- synchronous blocking REQ/REP or async DEALER/ROUTER for CALLs
- at-least-once delivery (confirms after processing) or at-most-once
(confirms before processing)
- do we want fault tolerant CALL and/or CAST, NOTIFY (AFAIK, zeromq
supports HA only for REQ/REP, ROUTER/DEALER)
- idempotent/commutative operations or ordered and non-idempotent
- event based notifications on changes in the numbers of clients and
servers (affects the discovery service very much)
- routing proxies and discovery service backends

3) Address all of the ambiguities of the API documentation in order to
keep messaging library developers and apps developers "at the same
page". This is a vital step as the new driver -as well as the existing
ones- have to rely on well known and clearly described expectations and
behave appropriately.

A cite:
"which is the behavior desired by apps" and "which is the behavior
advertised by the oslo messaging API". Like:
"what happens if a recipient of a CALL or CAST message dies
before the message is sent. Is the API supposed to return an error and
if yes how quickly?"
I believe this also applies to:
- what happens if a server sent a receive confirmation to a client and
crashed later while processing the request (at-most-once delivery is
assumed here)
- what happens if a server received duplicating requests from client(s)
- what happens if a client has never received a reply from server
- what happens if a client died right after it received a reply from a
- what happens if a request or a reply failed to be delivered by the
underlying AMQP driver
- what happens if AMQP utilization is too high at the either client or
server side
...and probably to many of the other tricky cases as well.

Let's brainstorm and update the driver specification and the API

Best regards,
Bogdan Dobrelya,
Irc #bogdando

More information about the OpenStack-dev mailing list