[openstack-dev] [oslo][mistral] Saga of process than ack and where can we go from here...

Mehdi Abaakouk sileht at sileht.net
Wed May 4 18:49:10 UTC 2016


Le 2016-05-04 10:04, Renat Akhmerov a écrit :
> No problem. Let’s not call it RPC (btw, I completely agree with that).
> But it’s one of the messaging patterns and hence should be under
> oslo.messaging I guess, no?

Yes and no, we currently have two APIs (rpc and notification). And
personally I regret to have the notification part in oslo.messaging.

RPC and Notification are different beasts, and both are today limited
in terms of feature because they share the same driver implementation.

Our RPC errors handling is really poor, for example Nova just put
instance in ERROR when something bad occurs in oslo.messaging layer.
This enforces deployer/user to fix the issue manually.

Our Notification system doesn't allow fine grain routing of message,
everything goes into one configured topic/queue.

And now we want to add a new one... I'm not against this idea,
but I'm not a huge fan.

>>>> Thoughts from folks (mistral and oslo)?
>> 
>> Also, I was not at the Summit, should I conclude the Tooz+taskflow 
>> approach (that ensure the idempotent of the application within the 
>> library API) have not been accepted by mistral folks ?
> 
> 
> Speaking about idempotency, IMO it’s not a central question that we
> should be discussing here. Mistral users should have a choice: if they
> manage to make their actions idempotent it’s excellent, in many cases
> idempotency is certainly possible, btw. If no, then they know about
> potential consequences.

You shouldn't mix the idempotency of the user task and the idempotency
of a Mistral action (that will at the end run the user task).
You can have your Mistral task runner implementation idempotent and just
make the workflow to use configurable in case the user task is
interrupted or badly finished even if the user task is idempotent or 
not.
This makes the thing very predictable. You will know for example:
* if the user task has started or not,
* if the error is due to a node power cut when the user task runs,
* if you can safely retry a not idempotent user task on an other node,
* you will not be impacted by rabbitmq restart or TCP connection issues,
* ...

With the oslo.messaging approach, everything will just end up in a
generic MessageTimeout error.

The RPC API already have this kind of issue. Applications have 
unfortunately
dealt with that (and I think they want something better now).
I'm just not convinced we should add a new "working queue" API in
oslo.messaging for tasks scheduling that have the same issue we already
have with RPC.

Anyway, that's your choice, if you want rely on this poor structure, I 
will
not be against, I'm not involved in Mistral. I just want everybody is 
aware
of this.

> And even in this case there’s usually a number
> of measures that can be taken to mitigate those consequences (reruning
> workflows from certain points after manually fixing problems, rollback
> scenarios etc.).

taskflow allows to describe and automate this kind of workflow really 
easily.

> What I’m saying is: let’s not make that crucial decision now about
> what a messaging framework should support or not, let’s make it more
> flexible to account for variety of different usage scenarios.

I think the confusion is in the "messaging" keyword, currently 
oslo.messaging
is a "RPC" framework and a "Notification" framework on top of 
'messaging'
frameworks.

Messaging framework we uses are 'kombu', 'pika', 'zmq' and 'pingus'.

> It’s normal for frameworks to give more rather than less.

I disagree, here we mix different concepts into one library, all 
concepts
have to be implemented by different 'messaging framework',
So we fortunately give less to make thing just works in the same way 
with all
drivers for all APIs.

> One more thing, at the summit we were discussing the possibility to
> define at-most-once/at-least-once individually for Mistral tasks. This
> is demanded because there cases where we need to do it, advanced users
> may choose one or another depending on a task/action semantics.
> However, it won’t be possible to implement w/o changes in the
> underlying messaging framework.

If we goes that way, oslo.messaging users and Mistral users have to be 
aware
that their job/task/action/whatever will perhaps not be called 
(at-most-once)
or perhaps called twice (at-least-once).

The oslo.messaging/Mistral API and docs must be clear about this 
behavior to
not having bugs open against oslo.messaging because script written via 
Mistral
API is not executed as expected "sometimes".
"sometimes" == when deployers have trouble with its rabbitmq (or 
whatever)
broker and even just when a deployer restart a broker node or when a TCP
issue occurs. At this end the backtrace in theses cases always trows 
only
oslo.messaging trace (the well known MessageTimeout...).


Also oslo.messaging is already a fragile brick used by everybody that a 
very small subset of people maintain (thanks to them).

I'm afraid that adding such new API will increase the needed maintenance 
for this lib while currently not many people care about (the whole lib 
not the new API).

I also wonder if other project have the same needs (that always help to 
design a new API).

Cheers,

-- 
Mehdi Abaakouk
mail: sileht at sileht.net
irc: sileh



More information about the OpenStack-dev mailing list