[openstack-dev] [Heat] Using Job Queues for timeout ops

Murugan, Visnusaran visnusaran.murugan at hp.com
Thu Nov 13 08:29:49 UTC 2014


Hi all,

Convergence-POC distributes stack operations by sending resource actions over RPC for any heat-engine to execute. Entire stack lifecycle will be controlled by worker/observer notifications. This distributed model has its own advantages and disadvantages.

Any stack operation has a timeout and a single engine will be responsible for it. If that engine goes down, timeout is lost along with it. So a traditional way is for other engines to recreate timeout from scratch. Also a missed resource action notification will be detected only when stack operation timeout happens.

To overcome this, we will need the following capability:

1.       Resource timeout (can be used for retry)

2.       Recover from engine failure (loss of stack timeout, resource action notification)


Suggestion:

1.       Use task queue like celery to host timeouts for both stack and resource.

2.       Poll database for engine failures and restart timers/ retrigger resource retry (IMHO: This would be a traditional and weighs heavy)

3.       Migrate heat to use TaskFlow. (Too many code change)

I am not suggesting we use Task Flow. Using celery will have very minimum code change. (decorate appropriate functions)


Your thoughts.

-Vishnu
IRC: ckmvishnu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20141113/466da54d/attachment.html>


More information about the OpenStack-dev mailing list