[openstack-dev] [Heat] Using Job Queues for timeout ops

Joshua Harlow harlowja at outlook.com
Thu Nov 13 22:01:14 UTC 2014

On Nov 13, 2014, at 7:10 AM, Clint Byrum <clint at fewbar.com> wrote:

> Excerpts from Joshua Harlow's message of 2014-11-13 00:45:07 -0800:
>> A question;
>> How is using something like celery in heat vs taskflow in heat (or at least concept [1]) 'to many code change'.
>> Both seem like change of similar levels ;-)
> I've tried a few times to dive into refactoring some things to use
> TaskFlow at a shallow level, and have always gotten confused and
> frustrated.
> The amount of lines that are changed probably is the same. But the
> massive shift in thinking is not an easy one to make. It may be worth some
> thinking on providing a shorter bridge to TaskFlow adoption, because I'm
> a huge fan of the idea and would _start_ something with it in a heartbeat,
> but refactoring things to use it feels really weird to me.

I wonder how I can make that better...

Where the concepts that new/different? Maybe I just have more of a functional programming background and the way taskflow gets you to create tasks that are later executed, order them ahead of time, and then *later* run them is still a foreign concept for folks that have not done things with non-procedural languages. What were the confusion points, how may I help address them? More docs maybe, more examples, something else?

I would agree that the jobboard[0] concept is different than the other parts of taskflow, but it could be useful here:

Basically at its core its a application of zookeeper where 'jobs' are posted to a directory (using sequenced nodes in zookeeper, so that ordering is retained). Entities then acquire ephemeral locks on those 'jobs' (these locks will be released if the owner process disconnects, or fails...) and then work on the contents of that job (where contents can be pretty much arbitrary). This creates a highly available job queue (queue-like due to the node sequencing[1]), and it sounds pretty similar to what zaqar could provide in theory (except the zookeeper one is proven, battle-hardened, works and exists...). But we should of course continue being scared of zookeeper, because u know, who wants to use a tool where it would fit, haha (this is a joke).

[0] https://github.com/openstack/taskflow/blob/master/taskflow/jobs/jobboard.py#L25 

[1] http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming

>> What was your metric for determining the code change either would have (out of curiosity)?
>> Perhaps u should look at [2], although I'm unclear on what the desired functionality is here.
>> Do u want the single engine to transfer its work to another engine when it 'goes down'? If so then the jobboard model + zookeper inherently does this.
>> Or maybe u want something else? I'm probably confused because u seem to be asking for resource timeouts + recover from engine failure (which seems like a liveness issue and not a resource timeout one), those 2 things seem separable.
> I agree with you on this. It is definitely a liveness problem. The
> resource timeout isn't something I've seen discussed before. We do have
> a stack timeout, and we need to keep on honoring that, but we can do
> that with a job that sleeps for the stack timeout if we have a liveness
> guarantee that will resurrect the job (with the sleep shortened by the
> time since stack-update-time) somewhere else if the original engine
> can't complete the job.
>> [1] http://docs.openstack.org/developer/taskflow/jobs.html
>> [2] http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple
>> On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran <visnusaran.murugan at hp.com> wrote:
>>> Hi all,
>>> Convergence-POC distributes stack operations by sending resource actions over RPC for any heat-engine to execute. Entire stack lifecycle will be controlled by worker/observer notifications. This distributed model has its own advantages and disadvantages.
>>> Any stack operation has a timeout and a single engine will be responsible for it. If that engine goes down, timeout is lost along with it. So a traditional way is for other engines to recreate timeout from scratch. Also a missed resource action notification will be detected only when stack operation timeout happens.
>>> To overcome this, we will need the following capability:
>>> 1.       Resource timeout (can be used for retry)
>>> 2.       Recover from engine failure (loss of stack timeout, resource action notification)
>>> Suggestion:
>>> 1.       Use task queue like celery to host timeouts for both stack and resource.
>>> 2.       Poll database for engine failures and restart timers/ retrigger resource retry (IMHO: This would be a traditional and weighs heavy)
>>> 3.       Migrate heat to use TaskFlow. (Too many code change)
>>> I am not suggesting we use Task Flow. Using celery will have very minimum code change. (decorate appropriate functions)
>>> Your thoughts.
>>> -Vishnu
>>> IRC: ckmvishnu
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

More information about the OpenStack-dev mailing list