[openstack-dev] [Mistral] Refine engine <-> executor protocol
rakhmerov at mirantis.com
Wed Jun 18 05:58:11 UTC 2014
On 13 Jun 2014, at 07:03, W Chan <m4d.coder at gmail.com> wrote:
> Design proposal for blueprint https://blueprints.launchpad.net/mistral/+spec/mistral-engine-executor-protocol
> Rename Executor to Worker.
I’d be ok with Worker but would prefer ActionRunner since it reflects the purpose a little better although being more verbose.
> Continue to use RPC server-client model in oslo.messaging for Engine and Worker protocol.
> Use asynchronous call (cast) between Worker and Engine where appropriate.
I would even emphasize: only async calls make sense.
> Remove any DB access from Worker. DB IO will only be done by Engine.
I still have doubts it’s actually possible. This is a part of the issue I mentioned in the previous email. I’ll post more detailed email on that separately.
> Worker updates Engine that it's going to start running action now. If execution is not RUNNING and task is not IDLE, Engine tells Worker to halt at this point. Worker cannot assume execution state is RUNNING and task state is IDLE because the handle_task message could have been sitting in the message queue for awhile. This call between Worker and Engine is synchronous, meaning Worker will wait for a response from the Engine. Currently, Executor checks state and updates task state directly to the DB before running the action.
Yes, that’s how it works now. First of all, like I said before we can’t afford making any sync calls between engine and executor because it’ll lead to problems with scalability and fault tolerance. So for that reason we make DB calls directly to make sure that execution and the task itself are in the suitable state. This would only work reliably for READ_COMMITTED transactions used in both engine and executor which I believe currently isn’t true since we use sqlite (it doesn’t seem to support them, right?). With mysql it should be fine.
So the whole initial idea was to use DB whenever we need to make sure that something is in a right state. That’s why all the reads should see only committed data. And we use queue just to notify executors about new tasks. Basically we could have even not used a queue and instead used db poll but with queue it looked more elegant.
It’s all part of one problem. Let’s try to figure out options to simplify the protocol and make it more reliable.
> Worker communicates result (success or failure) to Engine. Currently, Executor is inconsistent and calls Engine.convey_task_result on success and write directly to DB on failure.
Yes, that probably needs to be fixed.
> Engine -> Worker.handle_task
> Worker converts action spec to Action instance
Yes, it uses action spec in case if it’s ad-hoc action. If not, it just gets action class from the factory and instantiate it.
> Worker -> Engine.confirm_task_execution. Engine returns an exception if execution state is not RUNNING or task state is not IDLE.
Maybe I don’t entirely follow your thought but I think it’s not going to work. After engine confirms everything’s OK we’ll have a concurrency window again after that we’ll have to confirm the states again. That’s why I was talking about READ_COMMITTED DB transactions: we need to eliminate concurrency windows.
> Worker runs action
> Worker -> Engine.convey_task_result
That looks fine (it’s as it is now). Maybe the only thing we need to pay attention to is to how we communicate errors back to engine. It seems logical that “convey_task_result()” can also be used to pass information about errors that that error is considered a special case of a regular result. Need to think it over though...
@ Mirantis Inc.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev