[openstack-dev] [oslo.messaging][zeromq] Next step

Alec Hothan (ahothan) ahothan at cisco.com
Tue Jul 14 16:59:21 UTC 2015


On 7/8/15, 8:23 AM, "Bogdan Dobrelya" <bdobrelia at mirantis.com> wrote:

>>> On 6/12/15, 3:55 PM, "Clint Byrum" <cl... at fewbar.com> wrote:
>>> >
>>> >> 
>>> >
>>> >I think you missed "it is not tested in the gate" as a root cause for
>>> >some of the ambiguity. Anecdotes and bug reports are super important
>>> >knowing where to invest next, but a test suite would at least
>>>establish a
>>> >base line and prevent the sort of thrashing and confusion that comes
>>> >such a diverse community of users feeding bug reports into the system.
>>> I agree with you that zmq needs to pass whatever oslo messaging test is
>>> currently available however this won't remove all the
>>> semantical/behavioral ambiguities.
>>> This kind of ambiguities could be fixed by enhancing the API
>>> - always good to do even if a bit late - and by developing the
>>> test cases (although they tend to be harder to write).
>>> Another (ugly) strategy could be to simply say that the intended
>>> is the one exposed by the rabbitMQ based implementation (by means of
>>> seniority and/or actual deployment mileage).
>>> For example, what happens if a recipient of a CALL or CAST message dies
>>> before the message is sent.
>>> Is the API supposed to return an error and if yes how quickly? RabbitMQ
>>> based implementation will
>>> likely return a success (since the message will sit in a queue in the
>>> broker until the consumer reconnects - which could be a long time)
>>> ZMQ based will depend on the type of pattern used. Which is the
>>> desired by apps and which is the behavior "advertised" by the oslo
>>> messaging API?
>>> Another example relates to flow control conditions (sender sends lots
>>> CAST, receiver very slow to consume). Should the sender
>>> - always receive success and all messages will be queued without limit,
>>> - always receive success and all messages will be queued up to a
>>> point and new messages will be dropped silently
>>> - or receive an EAGAIN error (socket behavior)?
>>> In these unclear conditions, switching to a different transport driver
>>> going to be tricky because apps may have been written/patched to
>>>assume a
>>> certain behavior and might no longer behave properly if the expected
>>> behavior changes (even if it is for the better) and may require
>>> existing apps (to support a different behavior of the API).
>>> Note that "switching to a different transport" is not just about
>>> it in devstack but also about deploying it at scale on real production
>>> environment and testing at scale.
>> Alec, you bring up fantastic and importan points above.
>> However, lets stay on track. We're not even testing to see if nova-api
>> can talk to nova-conductor via the current zmq driver. It's entirely
>> possible it simply does not work for any number of reasons.
>> A devstack-gate job is the _minimum_ needed.
>I believe the next steps can be summarized as the following:
>1) Make existing zeromq driver tested in the gate in order to save it
>from deprecation and removal.

I believe Oleksii is already working on it.

>2) Think of the new driver architecture decisions more precisely, like:
>- synchronous blocking REQ/REP or async DEALER/ROUTER for CALLs
>- at-least-once delivery (confirms after processing) or at-most-once
>(confirms before processing)
>- do we want fault tolerant CALL and/or CAST, NOTIFY (AFAIK, zeromq
>supports HA only for REQ/REP, ROUTER/DEALER)
>- idempotent/commutative operations or ordered and non-idempotent

On all above I believe it is best to keep oslo messaging simple and
predictable, then have apps deal with any retry logic as it is really app
Auto retries in oslo messaging can cause confusion with possible
duplicates which could be really bad if the messages are not idempotent.
I think trying to make oslo messaging a complex communication API is not
realistic with the few resources available.
It is much better to have something simple that works well (even that is
not easy as we can see) than something complex that has lots of issues.

>- event based notifications on changes in the numbers of clients and
>servers (affects the discovery service very much)
>- routing proxies and discovery service backends

Yes I'd like to help on that part.

>3) Address all of the ambiguities of the API documentation in order to
>keep messaging library developers and apps developers "at the same
>page". This is a vital step as the new driver -as well as the existing
>ones- have to rely on well known and clearly described expectations and
>behave appropriately.

I'm glad to see more people converging on this shortcoming and the need to
do something.

As I said above, I would keep the oslo messaging API straight and simple
and predictable.
The issue with that is it may make the AMQP driver non compliant as it may
be doing too much already but we can try to work it out.
We should avoid having app code having to behave differently (with if/else
based on the driver or driver specific plugins) but maybe that will not be
entirely unavoidable.

I'll give a short answer to all those great questions below, in the event
we decide to go the simple API behavior route:

>A cite:
>"which is the behavior desired by apps" and "which is the behavior
>advertised by the oslo messaging API". Like:
>"what happens if a recipient of a CALL or CAST message dies
>before the message is sent. Is the API supposed to return an error and
>if yes how quickly?"

yes sender is notified about the error condition (asynchronously in the
case of async APIs) and as quickly as possible and the app is in charge of
remediating to possible loss of messages (this is basically reflecting how
tcp or zmq unicast behaves).

RabbitMQ would not comply because it would try to deliver the message
regardless without telling the sender (until at some point it may give up
entirely and drop the message silently or it may try to resend forever) or
there exist some use cases where the message is lost (and the sender not
ZMQ driver is simpler because it would just reflect what the ZMQ/unicast
library does.

>I believe this also applies to:
>- what happens if a server sent a receive confirmation to a client and
>crashed later while processing the request (at-most-once delivery is
>assumed here)

that would be an app bug. You don't want to send a ack before the work is
done and committed.

>- what happens if a server received duplicating requests from client(s)

that could be a oslo messaging bug if we make sure we never allow
duplicates (that is never retry unless you are sure the recipient has not
already received or make sure filtering is done properly on the receiving
end to weed out duplicates).

>- what happens if a client has never received a reply from server

For CALL: use timeout and let the app remediate to it
For CAST: leave the app remediate to it

>- what happens if a client died right after it received a reply from a

(assuming CALL) in this case let the app handle this (in general apps will
have to do some sort of resync with the recipients on restart if they

>- what happens if a request or a reply failed to be delivered by the
>underlying AMQP driver
>- what happens if AMQP utilization is too high at the either client or
>server side

I'll leave that to AMQP experts. At the oslo messaging layer I'd try to
make it behave the same as using a tcp connection (if possible).

>...and probably to many of the other tricky cases as well.

For me the tricky part is the fanout case because it is not trivial to
implement properly in a way that scales to thousands of nodes and in a way
that users can actually code over it properly without unexpected missing
messages for joining subscribers. From what I have seen this part is
completely overlooked today by existing fanout users (we might be lucky
fanout messages sort of work today but that might be problematic as we
scale out on larger deployments).

>Let's brainstorm and update the driver specification and the API

Best would be to have some working document that everybody can contribute
to. I 
think dims was proposing to create a new launchpad bug to track and use an
rst spec file with gerrit?



>Best regards,
>Bogdan Dobrelya,
>Irc #bogdando
>OpenStack Development Mailing List (not for usage questions)
>Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe

More information about the OpenStack-dev mailing list