[openstack-dev] [oslo][mistral] Saga of process than ack and where can we go from here...

Deja, Dawid dawid.deja at intel.com
Fri Jun 3 11:25:20 UTC 2016

On Thu, 2016-05-05 at 11:08 +0700, Renat Akhmerov wrote:

On 05 May 2016, at 01:49, Mehdi Abaakouk <sileht at sileht.net<mailto:sileht at sileht.net>> wrote:

Le 2016-05-04 10:04, Renat Akhmerov a écrit :
No problem. Let’s not call it RPC (btw, I completely agree with that).
But it’s one of the messaging patterns and hence should be under
oslo.messaging I guess, no?

Yes and no, we currently have two APIs (rpc and notification). And
personally I regret to have the notification part in oslo.messaging.

RPC and Notification are different beasts, and both are today limited
in terms of feature because they share the same driver implementation.

Our RPC errors handling is really poor, for example Nova just put
instance in ERROR when something bad occurs in oslo.messaging layer.
This enforces deployer/user to fix the issue manually.

Our Notification system doesn't allow fine grain routing of message,
everything goes into one configured topic/queue.

And now we want to add a new one... I'm not against this idea,
but I'm not a huge fan.

Thoughts from folks (mistral and oslo)?
Also, I was not at the Summit, should I conclude the Tooz+taskflow approach (that ensure the idempotent of the application within the library API) have not been accepted by mistral folks ?
Speaking about idempotency, IMO it’s not a central question that we
should be discussing here. Mistral users should have a choice: if they
manage to make their actions idempotent it’s excellent, in many cases
idempotency is certainly possible, btw. If no, then they know about
potential consequences.

You shouldn't mix the idempotency of the user task and the idempotency
of a Mistral action (that will at the end run the user task).
You can have your Mistral task runner implementation idempotent and just
make the workflow to use configurable in case the user task is
interrupted or badly finished even if the user task is idempotent or not.
This makes the thing very predictable. You will know for example:
* if the user task has started or not,
* if the error is due to a node power cut when the user task runs,
* if you can safely retry a not idempotent user task on an other node,
* you will not be impacted by rabbitmq restart or TCP connection issues,
* ...

With the oslo.messaging approach, everything will just end up in a
generic MessageTimeout error.

The RPC API already have this kind of issue. Applications have unfortunately
dealt with that (and I think they want something better now).
I'm just not convinced we should add a new "working queue" API in
oslo.messaging for tasks scheduling that have the same issue we already
have with RPC.

Anyway, that's your choice, if you want rely on this poor structure, I will
not be against, I'm not involved in Mistral. I just want everybody is aware
of this.

And even in this case there’s usually a number
of measures that can be taken to mitigate those consequences (reruning
workflows from certain points after manually fixing problems, rollback
scenarios etc.).

taskflow allows to describe and automate this kind of workflow really easily.

What I’m saying is: let’s not make that crucial decision now about
what a messaging framework should support or not, let’s make it more
flexible to account for variety of different usage scenarios.

I think the confusion is in the "messaging" keyword, currently oslo.messaging
is a "RPC" framework and a "Notification" framework on top of 'messaging'

Messaging framework we uses are 'kombu', 'pika', 'zmq' and 'pingus'.

It’s normal for frameworks to give more rather than less.

I disagree, here we mix different concepts into one library, all concepts
have to be implemented by different 'messaging framework',
So we fortunately give less to make thing just works in the same way with all
drivers for all APIs.

One more thing, at the summit we were discussing the possibility to
define at-most-once/at-least-once individually for Mistral tasks. This
is demanded because there cases where we need to do it, advanced users
may choose one or another depending on a task/action semantics.
However, it won’t be possible to implement w/o changes in the
underlying messaging framework.

If we goes that way, oslo.messaging users and Mistral users have to be aware
that their job/task/action/whatever will perhaps not be called (at-most-once)
or perhaps called twice (at-least-once).

The oslo.messaging/Mistral API and docs must be clear about this behavior to
not having bugs open against oslo.messaging because script written via Mistral
API is not executed as expected "sometimes".
"sometimes" == when deployers have trouble with its rabbitmq (or whatever)
broker and even just when a deployer restart a broker node or when a TCP
issue occurs. At this end the backtrace in theses cases always trows only
oslo.messaging trace (the well known MessageTimeout...).

Also oslo.messaging is already a fragile brick used by everybody that a very small subset of people maintain (thanks to them).

I'm afraid that adding such new API will increase the needed maintenance for this lib while currently not many people care about (the whole lib not the new API).

I also wonder if other project have the same needs (that always help to design a new API).


What are you proposing? Can you confirm that we should be just dealing with this problem on our own in Mistral? If so, that works well for us. Initially we didn’t want to switch to oslo.messaging from direct access to RabbitMQ for this and also other reasons. But we got a strong feedback from the community that said “you guys need to reuse technologies from the community and hence switch to oslo.messaging”. So we did, assuming that we would fix all needed issues in oslo.messaging relatively soon. Now it’s been ~2 years since then and we keep struggling with all that stuff.

When I see these discussions again and again where people try to convince that at-least-one delivery is a bad thing I can’t participate in them anymore. We spent a lot of time thinking about it and experimenting with it and know all pros and cons.

Renat Akhmerov

Maybe this could be resolved in oslo.messaging by following one of Python slogans we are all responsible users here [1].

What I'm proposing is to let the consumer of the message decide when to send ACK, because it knows best when to do so.  I can think of scenarios when it is required to send ACK in a middle of message process e.g. after receiving message I want to store it in the DB before sending an ACK and send it when message is safely stored. Having that we could implement whatever delivery model we want in mistral on top of oslo.messaging.

[1] https://en.wikipedia.org/wiki/Python_syntax_and_semantics#Objects

Dawid Deja
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160603/5dbae569/attachment.html>

More information about the OpenStack-dev mailing list