[openstack-dev] [oslo][mistral] Saga of process than ack and where can we go from here...

Renat Akhmerov renat.akhmerov at gmail.com
Thu May 5 04:08:44 UTC 2016

> On 05 May 2016, at 01:49, Mehdi Abaakouk <sileht at sileht.net> wrote:
> Le 2016-05-04 10:04, Renat Akhmerov a écrit :
>> No problem. Let’s not call it RPC (btw, I completely agree with that).
>> But it’s one of the messaging patterns and hence should be under
>> oslo.messaging I guess, no?
> Yes and no, we currently have two APIs (rpc and notification). And
> personally I regret to have the notification part in oslo.messaging.
> RPC and Notification are different beasts, and both are today limited
> in terms of feature because they share the same driver implementation.
> Our RPC errors handling is really poor, for example Nova just put
> instance in ERROR when something bad occurs in oslo.messaging layer.
> This enforces deployer/user to fix the issue manually.
> Our Notification system doesn't allow fine grain routing of message,
> everything goes into one configured topic/queue.
> And now we want to add a new one... I'm not against this idea,
> but I'm not a huge fan.
>>>>> Thoughts from folks (mistral and oslo)?
>>> Also, I was not at the Summit, should I conclude the Tooz+taskflow approach (that ensure the idempotent of the application within the library API) have not been accepted by mistral folks ?
>> Speaking about idempotency, IMO it’s not a central question that we
>> should be discussing here. Mistral users should have a choice: if they
>> manage to make their actions idempotent it’s excellent, in many cases
>> idempotency is certainly possible, btw. If no, then they know about
>> potential consequences.
> You shouldn't mix the idempotency of the user task and the idempotency
> of a Mistral action (that will at the end run the user task).
> You can have your Mistral task runner implementation idempotent and just
> make the workflow to use configurable in case the user task is
> interrupted or badly finished even if the user task is idempotent or not.
> This makes the thing very predictable. You will know for example:
> * if the user task has started or not,
> * if the error is due to a node power cut when the user task runs,
> * if you can safely retry a not idempotent user task on an other node,
> * you will not be impacted by rabbitmq restart or TCP connection issues,
> * ...
> With the oslo.messaging approach, everything will just end up in a
> generic MessageTimeout error.
> The RPC API already have this kind of issue. Applications have unfortunately
> dealt with that (and I think they want something better now).
> I'm just not convinced we should add a new "working queue" API in
> oslo.messaging for tasks scheduling that have the same issue we already
> have with RPC.
> Anyway, that's your choice, if you want rely on this poor structure, I will
> not be against, I'm not involved in Mistral. I just want everybody is aware
> of this.
>> And even in this case there’s usually a number
>> of measures that can be taken to mitigate those consequences (reruning
>> workflows from certain points after manually fixing problems, rollback
>> scenarios etc.).
> taskflow allows to describe and automate this kind of workflow really easily.
>> What I’m saying is: let’s not make that crucial decision now about
>> what a messaging framework should support or not, let’s make it more
>> flexible to account for variety of different usage scenarios.
> I think the confusion is in the "messaging" keyword, currently oslo.messaging
> is a "RPC" framework and a "Notification" framework on top of 'messaging'
> frameworks.
> Messaging framework we uses are 'kombu', 'pika', 'zmq' and 'pingus'.
>> It’s normal for frameworks to give more rather than less.
> I disagree, here we mix different concepts into one library, all concepts
> have to be implemented by different 'messaging framework',
> So we fortunately give less to make thing just works in the same way with all
> drivers for all APIs.
>> One more thing, at the summit we were discussing the possibility to
>> define at-most-once/at-least-once individually for Mistral tasks. This
>> is demanded because there cases where we need to do it, advanced users
>> may choose one or another depending on a task/action semantics.
>> However, it won’t be possible to implement w/o changes in the
>> underlying messaging framework.
> If we goes that way, oslo.messaging users and Mistral users have to be aware
> that their job/task/action/whatever will perhaps not be called (at-most-once)
> or perhaps called twice (at-least-once).
> The oslo.messaging/Mistral API and docs must be clear about this behavior to
> not having bugs open against oslo.messaging because script written via Mistral
> API is not executed as expected "sometimes".
> "sometimes" == when deployers have trouble with its rabbitmq (or whatever)
> broker and even just when a deployer restart a broker node or when a TCP
> issue occurs. At this end the backtrace in theses cases always trows only
> oslo.messaging trace (the well known MessageTimeout...).
> Also oslo.messaging is already a fragile brick used by everybody that a very small subset of people maintain (thanks to them).
> I'm afraid that adding such new API will increase the needed maintenance for this lib while currently not many people care about (the whole lib not the new API).
> I also wonder if other project have the same needs (that always help to design a new API).


What are you proposing? Can you confirm that we should be just dealing with this problem on our own in Mistral? If so, that works well for us. Initially we didn’t want to switch to oslo.messaging from direct access to RabbitMQ for this and also other reasons. But we got a strong feedback from the community that said “you guys need to reuse technologies from the community and hence switch to oslo.messaging”. So we did, assuming that we would fix all needed issues in oslo.messaging relatively soon. Now it’s been ~2 years since then and we keep struggling with all that stuff.

When I see these discussions again and again where people try to convince that at-least-one delivery is a bad thing I can’t participate in them anymore. We spent a lot of time thinking about it and experimenting with it and know all pros and cons.

Renat Akhmerov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160505/dd7bd037/attachment.html>

More information about the OpenStack-dev mailing list