[openstack-dev] [Heat] Using Job Queues for timeout ops

Murugan, Visnusaran visnusaran.murugan at hp.com
Thu Nov 13 09:27:42 UTC 2014


Intension is not to transfer work load of a failed engine onto an active one. Convergence implementation that we are working on will be able to recover from a failure, provided a timeout notification hits heat-engine. All I want is a safe holding area for my timeout tasks. Timeout can be a stack timeout or a resource timeout.

By code change :) I meant posting to a job queue will be a matter of decorating timeout method and firing it for a delayed execution. Felt that we need not use taskflow just for posting a delayed execution(timer in our case).

Correct me if I'm wrong.


From: Joshua Harlow [mailto:harlowja at outlook.com]
Sent: Thursday, November 13, 2014 2:15 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Heat] Using Job Queues for timeout ops

A question;

How is using something like celery in heat vs taskflow in heat (or at least concept [1]) 'to many code change'.

Both seem like change of similar levels ;-)

What was your metric for determining the code change either would have (out of curiosity)?

Perhaps u should look at [2], although I'm unclear on what the desired functionality is here.

Do u want the single engine to transfer its work to another engine when it 'goes down'? If so then the jobboard model + zookeper inherently does this.

Or maybe u want something else? I'm probably confused because u seem to be asking for resource timeouts + recover from engine failure (which seems like a liveness issue and not a resource timeout one), those 2 things seem separable.

[1] http://docs.openstack.org/developer/taskflow/jobs.html

[2] http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple

On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran <visnusaran.murugan at hp.com<mailto:visnusaran.murugan at hp.com>> wrote:

Hi all,

Convergence-POC distributes stack operations by sending resource actions over RPC for any heat-engine to execute. Entire stack lifecycle will be controlled by worker/observer notifications. This distributed model has its own advantages and disadvantages.

Any stack operation has a timeout and a single engine will be responsible for it. If that engine goes down, timeout is lost along with it. So a traditional way is for other engines to recreate timeout from scratch. Also a missed resource action notification will be detected only when stack operation timeout happens.

To overcome this, we will need the following capability:
1.       Resource timeout (can be used for retry)
2.       Recover from engine failure (loss of stack timeout, resource action notification)

1.       Use task queue like celery to host timeouts for both stack and resource.
2.       Poll database for engine failures and restart timers/ retrigger resource retry (IMHO: This would be a traditional and weighs heavy)
3.       Migrate heat to use TaskFlow. (Too many code change)

I am not suggesting we use Task Flow. Using celery will have very minimum code change. (decorate appropriate functions)

Your thoughts.

IRC: ckmvishnu
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org<mailto:OpenStack-dev at lists.openstack.org>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20141113/d602415e/attachment.html>

More information about the OpenStack-dev mailing list