[openstack-dev] [Heat] Using Job Queues for timeout ops

Clint Byrum clint at fewbar.com
Thu Nov 13 15:10:34 UTC 2014


Excerpts from Joshua Harlow's message of 2014-11-13 00:45:07 -0800:
> A question;
> 
> How is using something like celery in heat vs taskflow in heat (or at least concept [1]) 'to many code change'.
> 
> Both seem like change of similar levels ;-)
> 

I've tried a few times to dive into refactoring some things to use
TaskFlow at a shallow level, and have always gotten confused and
frustrated.

The amount of lines that are changed probably is the same. But the
massive shift in thinking is not an easy one to make. It may be worth some
thinking on providing a shorter bridge to TaskFlow adoption, because I'm
a huge fan of the idea and would _start_ something with it in a heartbeat,
but refactoring things to use it feels really weird to me.

> What was your metric for determining the code change either would have (out of curiosity)?
> 
> Perhaps u should look at [2], although I'm unclear on what the desired functionality is here.
> 
> Do u want the single engine to transfer its work to another engine when it 'goes down'? If so then the jobboard model + zookeper inherently does this.
> 
> Or maybe u want something else? I'm probably confused because u seem to be asking for resource timeouts + recover from engine failure (which seems like a liveness issue and not a resource timeout one), those 2 things seem separable.
> 

I agree with you on this. It is definitely a liveness problem. The
resource timeout isn't something I've seen discussed before. We do have
a stack timeout, and we need to keep on honoring that, but we can do
that with a job that sleeps for the stack timeout if we have a liveness
guarantee that will resurrect the job (with the sleep shortened by the
time since stack-update-time) somewhere else if the original engine
can't complete the job.

> [1] http://docs.openstack.org/developer/taskflow/jobs.html
> 
> [2] http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple
> 
> On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran <visnusaran.murugan at hp.com> wrote:
> 
> > Hi all,
> >  
> > Convergence-POC distributes stack operations by sending resource actions over RPC for any heat-engine to execute. Entire stack lifecycle will be controlled by worker/observer notifications. This distributed model has its own advantages and disadvantages.
> >  
> > Any stack operation has a timeout and a single engine will be responsible for it. If that engine goes down, timeout is lost along with it. So a traditional way is for other engines to recreate timeout from scratch. Also a missed resource action notification will be detected only when stack operation timeout happens.
> >  
> > To overcome this, we will need the following capability:
> > 1.       Resource timeout (can be used for retry)
> > 2.       Recover from engine failure (loss of stack timeout, resource action notification)
> >  
> >  
> > Suggestion:
> > 1.       Use task queue like celery to host timeouts for both stack and resource.
> > 2.       Poll database for engine failures and restart timers/ retrigger resource retry (IMHO: This would be a traditional and weighs heavy)
> > 3.       Migrate heat to use TaskFlow. (Too many code change)
> >  
> > I am not suggesting we use Task Flow. Using celery will have very minimum code change. (decorate appropriate functions)
> >  
> >  
> > Your thoughts.
> >  
> > -Vishnu
> > IRC: ckmvishnu
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list