[openstack-dev] [Heat][Summit] Input wanted - real world heat spec

Robert Collins robertc at robertcollins.net
Fri Apr 25 05:08:41 UTC 2014


On 25 April 2014 09:23, Zane Bitter <zbitter at redhat.com> wrote:

>>   - take a holistic view and fix the system's emergent properties by
>> using a different baseline architecture within it
>>   - ???
>>   - profit!
>
>
> Thanks for writing this up Rob. This is certainly a more ambitious scale of
> application to deploy than we ever envisioned in the early days of Heat ;)
> But I firmly believe that what is good for TripleO will be great for the
> rest of our users too. All of the observed issues mentioned are things we
> definitely want to address.
>
> I have a few questions about the specific architecture being proposed. It's
> not clear to me what you mean by "call-stack style" in referring to the
> current paradigm. Maybe you could elaborate on how the current style and the
> "convergence style" differ.

So the call-stack style - we have an in-process data structure in the
heat engine which contains the traversal of the DAG. Its a bit awkward
to visualise because of the coroutine style layer in there - but if
you squash that back it starts to look like a regular callstack:

frame resource
0 root
1 root-A
2 root-A-B
3 root-A-B-C

(representing that we're bring up C which is a dep of B which is a dep
of A which hangs off the root).

The concurrency allowed by coroutines means this really is a tree of
callstacks - but as a style it has all the same characteristics:
 - code is called top-down
 - the thing being executed is live data in memory, and thus largely
untouchable from outside
 - the entire structure has to run to completion, or fail - it acts as
a single large 'procedure call'.

The style I'm proposing we use is one where:
 - code is called in response to events
 - we exit after taking the 'next step' in response to an event, so we
can be very responsive to changes in intent without requiring every
routine to support early-exit of some form; and the 'program'.
 - we can stop executing at any arbitrary point, because we're running
small units at a time.

> Specifically, I am not clear on whether 'convergence' means:
>  (a) Heat continues to respect the dependency graph but does not stop after
> one traversal, instead repeatedly processing it until (and even after) the
> stack is complete; or
>  (b) Heat ignores the dependency graph and just throws everything against
> the wall, repeating until it has all stuck.

Clint used (c), so I'll use (d).

d) Heat stops evaluating the whole graph and instead only evaluates
one node at a time before exiting. Further events (such as timeouts,
resources changing state, or the user requesting a change) trigger
Heat to evaluate a node.

> I also have doubts about the principle "Users should only need to intervene
> with a stack when there is no right action that Heat can take to deliver the
> current template+parameters". That sounds good in theory, but in practice
> it's very hard to know when there is a right action Heat can take and when
> there isn't. e.g. There are innumerable ways to create a template that can
> _never_ actually converge, and I don't believe there's a general way we can
> detect that, only the hard way: one error type at a time, for every single
> resource type. Offering users a way to control how and when that happens

I agree with the innumerable ways - thats a hard truth. For instance,
if nova is sick, instances may never come up, and trying forever to
spawn something that can't spawn is pointless.

However, Nova instance spawn success rates in many clouds (e.g.
rackspace and HP) are much less than 100% - treating a failed instance
spawn as an error is totally unrealistic. I contend that its Heat's
job to 'do what needs to be done' to get that nova instance, and if it
decides it cannot, then and only then to signal error higher up (which
for e.g. a scaling group might be to not error *at all* but just to
try another one).

Hmm, lets try this another way:
 - 'failed-but-retryable' at a local scope is well defined but hard to
code for (because as you say we have to add types to catch one at a
time, per resource type).
 - 'failed'  at a local scope is well defined - any exception we don't catch :)

BUT 'failed' at a higher level is not well defined: what does 'failed'
mean for a scaling group? I don't think its reasonable that a single
non-retryable API error in one of the nested stacks should invalidate
a scaling group as a whole. Now, lets go back to considering the local
scope of a single resource - if we ask Nova for an instance, and it
goes BUILDING->SPAWNING->ERROR, is that 'retryable'? I actually think
that 'retry' here on a per-error-code basis makes sense: what makes
sense is 'did the resource become usable? No -> try harder until
timeout. Yes? -> look holistically (e.g. DELETION_POLICY, is it in a
scaling group) to decide if its recoverable.

So generally speaking we can detect 'failed to converge in X period
hours' - and if you examine existing prior art that works in
production with Nova - things like 'nodepool' - thats exactly what
they do (the timeout in nodepool is infinity - it just keeps trying).

> allows them to make the best decisions for their particular circumstances -
> and hopefully a future WFaaS like Mistral will make it easy to set up
> continuous monitoring for those who require it. (Not incidentally, it also
> gives cloud operators an opportunity to charge their users in proportion to
> their actual requirements.)

I agree that users need ways to control how and when things fail. I
don't believe that constraints our internals much, if at all - though
some internal implementations will be much easier to implement rich
controls in.

>
>> This can be constrasted with many other existing attempts to design
>> solutions which relied on keeping the basic internals of heat as-is
>> and just tweaking things - an approach we don't believe will work -
>> the issues arise from the current architecture, not the quality of the
>> code (which is fine).
>
>
> Some of the ideas that have been proposed in the past:
>
> - Moving execution of operations on individual resources to a distributed
> execution system using taskflow. (This should address the scalability
> issue.)

Thats included in the proposal we've written up. Totally agree that
its a good, necessary and useful thing to do.

> - Updating the stored template in real time during stack updates - this is
> happening in Juno btw. (This will solve the problem of inability to ever
> recover from an update failure. In theory, it would also make it possible to
> interrupt a running update and make changes.)

Interrupting an update and then submitting a new change is disruptive
with the current paradigm - because the interrupt has to well
interrupt code, and any actions pending the success of a poll on the
backing API for a resource will be cancelled - and we have no current
systemic way to idempotently come back and do the remaining code. Such
a thing is part of the programming model we're proposing :).

Specific things this doesn't address:
 - dealing with a heat engine failure (users shouldn't need to know
that happened)
 - making it easy to update the stack from the user in high frequency
- 'interrupt + stack-update again' is not a pleasant UX IMO. How can
the user tell its safe to interrupt? Or if its always safe, why
require them to interrupt at all?

> - Implementing a 'stack converge' operation that the user can trigger to
> compare the actual state of the stack with the model and bring it back into
> spec.

As a user, this proposal horrified me. It told me I' was going to be
woken up by monitoring at 4am to fix a down cloud when all I would be
doing is running 'heat stack converge'... or I could put that in cron.
It was the genesis of thinking more deeply about this and indirectly
letd to the alternative proposal we have :). I don't have a user story
for 'I want my running stack to be different to the one I uploaded'. I
have plenty of stories of 'I want the stack I uploaded always' -
recovering from STNITH for instance.

> It would be interesting to see some analysis on exactly how these existing
> attempts fall down in trying to fulfil the goals, as well as the specific
> points at which the proposed implementation differs.

> Depending on the answers to the above questions, this proposal could be
> anything between a modest reworking of those existing ideas and a complete
> re-imagining of the entire concept of Heat. I'd very much like to find out
> where along that spectrum it lies :)

Its a modest reworking and synthesis of existing ideas, with some real
world operator experience, healthy paranoia about system reliability
and scaling concerns mixed in.

> BTW, it appears that the schedule you're suggesting involves assigning a
> bunch of people unfamiliar with the current code base and having them
> complete a ground-up rearchitecting of the whole engine, all within the Juno
> development cycle (about 3.5 months). This is simply not consistent with
> reality as I have observed it up to this point.

Juno is probably too aggressive to finish it all - but if we can kick
it off hard, we should see incremental benefits by J, and the whole
thing for K.

-Rob


-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-dev mailing list