[openstack-dev] [Heat] Using Job Queues for timeout ops

Clint Byrum clint at fewbar.com
Fri Nov 14 00:08:55 UTC 2014


Excerpts from Joshua Harlow's message of 2014-11-13 14:01:14 -0800:
> On Nov 13, 2014, at 7:10 AM, Clint Byrum <clint at fewbar.com> wrote:
> 
> > Excerpts from Joshua Harlow's message of 2014-11-13 00:45:07 -0800:
> >> A question;
> >> 
> >> How is using something like celery in heat vs taskflow in heat (or at least concept [1]) 'to many code change'.
> >> 
> >> Both seem like change of similar levels ;-)
> >> 
> > 
> > I've tried a few times to dive into refactoring some things to use
> > TaskFlow at a shallow level, and have always gotten confused and
> > frustrated.
> > 
> > The amount of lines that are changed probably is the same. But the
> > massive shift in thinking is not an easy one to make. It may be worth some
> > thinking on providing a shorter bridge to TaskFlow adoption, because I'm
> > a huge fan of the idea and would _start_ something with it in a heartbeat,
> > but refactoring things to use it feels really weird to me.
> 
> I wonder how I can make that better...
> 
> Where the concepts that new/different? Maybe I just have more of a functional programming background and the way taskflow gets you to create tasks that are later executed, order them ahead of time, and then *later* run them is still a foreign concept for folks that have not done things with non-procedural languages. What were the confusion points, how may I help address them? More docs maybe, more examples, something else?

My feeling is that it is hard to let go of the language constructs that
_seem_ to solve the problems TaskFlow does, even though in fact they are
the problem because they're using the stack for control-flow where we
want that control-flow to yield to TaskFlow.

I also kind of feel like the Twisted folks answered a similar question
with inline callbacks and made things "easier" but more complex in
doing so. If I had a good answer I would give it to you though. :)

> 
> I would agree that the jobboard[0] concept is different than the other parts of taskflow, but it could be useful here:
> 
> Basically at its core its a application of zookeeper where 'jobs' are posted to a directory (using sequenced nodes in zookeeper, so that ordering is retained). Entities then acquire ephemeral locks on those 'jobs' (these locks will be released if the owner process disconnects, or fails...) and then work on the contents of that job (where contents can be pretty much arbitrary). This creates a highly available job queue (queue-like due to the node sequencing[1]), and it sounds pretty similar to what zaqar could provide in theory (except the zookeeper one is proven, battle-hardened, works and exists...). But we should of course continue being scared of zookeeper, because u know, who wants to use a tool where it would fit, haha (this is a joke).
> 

So ordering is a distraction from the task at hand. But the locks that
indicate liveness of the workers is very interesting to me. Since we
don't actually have requirements of ordering on the front-end of the task
(we do on the completion of certain tasks, but we can use a DB for that),
I wonder if we can just get the same effect with a durable queue that uses
a reliable messaging pattern where we don't ack until we're done. That
would achieve the goal of liveness.

> [0] https://github.com/openstack/taskflow/blob/master/taskflow/jobs/jobboard.py#L25 
> 
> [1] http://zookeeper.apache.org/doc/trunk/zookeeperProgrammers.html#Sequence+Nodes+--+Unique+Naming
> 
> > 
> >> What was your metric for determining the code change either would have (out of curiosity)?
> >> 
> >> Perhaps u should look at [2], although I'm unclear on what the desired functionality is here.
> >> 
> >> Do u want the single engine to transfer its work to another engine when it 'goes down'? If so then the jobboard model + zookeper inherently does this.
> >> 
> >> Or maybe u want something else? I'm probably confused because u seem to be asking for resource timeouts + recover from engine failure (which seems like a liveness issue and not a resource timeout one), those 2 things seem separable.
> >> 
> > 
> > I agree with you on this. It is definitely a liveness problem. The
> > resource timeout isn't something I've seen discussed before. We do have
> > a stack timeout, and we need to keep on honoring that, but we can do
> > that with a job that sleeps for the stack timeout if we have a liveness
> > guarantee that will resurrect the job (with the sleep shortened by the
> > time since stack-update-time) somewhere else if the original engine
> > can't complete the job.
> > 
> >> [1] http://docs.openstack.org/developer/taskflow/jobs.html
> >> 
> >> [2] http://docs.openstack.org/developer/taskflow/examples.html#jobboard-producer-consumer-simple
> >> 
> >> On Nov 13, 2014, at 12:29 AM, Murugan, Visnusaran <visnusaran.murugan at hp.com> wrote:
> >> 
> >>> Hi all,
> >>> 
> >>> Convergence-POC distributes stack operations by sending resource actions over RPC for any heat-engine to execute. Entire stack lifecycle will be controlled by worker/observer notifications. This distributed model has its own advantages and disadvantages.
> >>> 
> >>> Any stack operation has a timeout and a single engine will be responsible for it. If that engine goes down, timeout is lost along with it. So a traditional way is for other engines to recreate timeout from scratch. Also a missed resource action notification will be detected only when stack operation timeout happens.
> >>> 
> >>> To overcome this, we will need the following capability:
> >>> 1.       Resource timeout (can be used for retry)
> >>> 2.       Recover from engine failure (loss of stack timeout, resource action notification)
> >>> 
> >>> 
> >>> Suggestion:
> >>> 1.       Use task queue like celery to host timeouts for both stack and resource.
> >>> 2.       Poll database for engine failures and restart timers/ retrigger resource retry (IMHO: This would be a traditional and weighs heavy)
> >>> 3.       Migrate heat to use TaskFlow. (Too many code change)
> >>> 
> >>> I am not suggesting we use Task Flow. Using celery will have very minimum code change. (decorate appropriate functions)
> >>> 
> >>> 
> >>> Your thoughts.
> >>> 
> >>> -Vishnu
> >>> IRC: ckmvishnu
> >>> _______________________________________________
> >>> OpenStack-dev mailing list
> >>> OpenStack-dev at lists.openstack.org
> >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > 
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list