[openstack-dev] [Heat] Using Job Queues for timeout ops
clint at fewbar.com
Thu Nov 13 14:58:54 UTC 2014
Excerpts from Zane Bitter's message of 2014-11-13 05:54:03 -0800:
> On 13/11/14 03:29, Murugan, Visnusaran wrote:
> > Hi all,
> > Convergence-POC distributes stack operations by sending resource actions
> > over RPC for any heat-engine to execute. Entire stack lifecycle will be
> > controlled by worker/observer notifications. This distributed model has
> > its own advantages and disadvantages.
> > Any stack operation has a timeout and a single engine will be
> > responsible for it. If that engine goes down, timeout is lost along with
> > it. So a traditional way is for other engines to recreate timeout from
> > scratch. Also a missed resource action notification will be detected
> > only when stack operation timeout happens.
> > To overcome this, we will need the following capability:
> > 1.Resource timeout (can be used for retry)
> I don't believe this is strictly needed for phase 1 (essentially we
> don't have it now, so nothing gets worse).
We do have a stack timeout, and it stands to reason that we won't have a
single box with a timeout greenthread after this, so a strategy is
> For phase 2, yes, we'll want it. One thing we haven't discussed much is
> that if we used Zaqar for this then the observer could claim a message
> but not acknowledge it until it had processed it, so we could have
> guaranteed delivery.
Frankly, if oslo.messaging doesn't support reliable delivery then we
need to add it. Zaqar should have nothing to do with this and is, IMO, a
poor choice at this stage, though I like the idea of using it in the
future so that we can make Heat more of an outside-the-cloud app.
> > 2.Recover from engine failure (loss of stack timeout, resource action
> > notification)
> > Suggestion:
> > 1.Use task queue like celery to host timeouts for both stack and resource.
> I believe Celery is more or less a non-starter as an OpenStack
> dependency because it uses Kombu directly to talk to the queue, vs.
> oslo.messaging which is an abstraction layer over Kombu, Qpid, ZeroMQ
> and maybe others in the future. i.e. requiring Celery means that some
> users would be forced to install Rabbit for the first time.
> One option would be to fork Celery and replace Kombu with oslo.messaging
> as its abstraction layer. Good luck getting that maintained though,
> since Celery _invented_ Kombu to be it's abstraction layer.
A slight side point here: Kombu supports Qpid and ZeroMQ. Oslo.messaging
is more about having a unified API than a set of magic backends. It
actually boggles my mind why we didn't just use kombu (cue 20 reactions
with people saying it wasn't EXACTLY right), but I think we're committed
to oslo.messaging now. Anyway, celery would need no such refactor, as
kombu would be able to access the same bus as everything else just fine.
> > 2.Poll database for engine failures and restart timers/ retrigger
> > resource retry (IMHO: This would be a traditional and weighs heavy)
> > 3.Migrate heat to use TaskFlow. (Too many code change)
> If it's just handling timed triggers (maybe this is closer to #2) and
> not migrating the whole code base, then I don't see why it would be a
> big change (or even a change at all - it's basically new functionality).
> I'm not sure if TaskFlow has something like this already. If not we
> could also look at what Mistral is doing with timed tasks and see if we
> could spin some of it out into an Oslo library.
I feel like it boils down to something running periodically checking for
scheduled tasks that are due to run but have not run yet. I wonder if we
can actually look at Ironic for how they do this, because Ironic polls
power state of machines constantly, and uses a hash ring to make sure
only one conductor is polling any one machine at a time. If we broke
stacks up into a hash ring like that for the purpose of singleton tasks
like timeout checking, that might work out nicely.
More information about the OpenStack-dev