[Openstack] ​[openstack-dev][mistral] failover/recovery and performance

Vitalii Solodilov mcdkr at yandex.ru
Tue Apr 18 15:28:45 UTC 2017


Hi all.
Do I understand correctly that mistral/ocata doesn't has a failover and recovery?

I found two problems.
First is in mistral-executor.
1. You need to run an action that executes for a long time. For example, std.sleep.
2. Kill mistral-executor where action executes.
3. Revive mistral-executor.
4. Workflow and task will be in RUNNING state forever.

WA:
Set timeout and retry attribute to all task.
But this maybe affected some cases.

Second is in mistral-engine.
1. You need to run many workflow that consists many tasks at the same time.
 For example, 20 workflows of 20 taks with std.noop action. 
The tasks are linked sequentially.
2. Kill mistral-engine.
3. Revive mistral-engine.
4. Some workflows and tasks will be in RUNNING state forever.

i think the problem is here https://github.com/openstack/mistral/blob/master/mistral/services/scheduler.py#L115 
When mistral-engine is died between two transactions. 
Delayed calls are marked as processing аnd never called by scheduller.
Do you have a WA for this case?

Is this the expected behavior?
Is there a similar case when the workflow will be in RUNNING state forever?
When will you do failover/executor? Will it be in the new release https://blueprints.launchpad.net/mistral/+spec/mistral-fault-tolerance ?
Will you merge it to ocata release?

Performance.

I found two problems too :)

First, when we start many workflows. For example 200. They are completed using a large amount of time.
Maybe the throttling could help here?

Second. Can do you help with mistral scaling? What settings should I change in mistral.conf or rabbit config? 

acceleration relative to 1 mistral-engine and 1 mistral-executor. Workflows consits of 20 taks with std.noop action.

2 mistral-engine and 2 mistral-executor 1.70
3 mistral-engine and 3 mistral-executor 2.08
5 mistral-engine and 5 mistral-executor 2.22

Thrid. 
A process of 20 tasks running time takes longer than one second.
The last step is the completion of the process takes 200 milliseconds for scheduling. 
Can i reduce the ratio https://github.com/openstack/mistral/blob/master/mistral/engine/workflow_handler.py#L115 ? 
Would it be worse?

Best regards,

Vitalii Solodilov




More information about the Openstack mailing list