[openstack-dev] [Heat] Where to keep data about stack breakpoints?

Steven Hardy shardy at redhat.com
Mon Jan 12 22:39:05 UTC 2015


On Mon, Jan 12, 2015 at 05:10:47PM -0500, Zane Bitter wrote:
> On 12/01/15 13:05, Steven Hardy wrote:
> >>>I also had a chat with Steve Hardy and he suggested adding a STOPPED state
> >>>to the stack (this isn't in the spec). While not strictly necessary to
> >>>implement the spec, this would help people figure out that the stack has
> >>>reached a breakpoint instead of just waiting on a resource that takes a long
> >>>time to finish (the heat-engine log and event-list still show that a
> >>>breakpoint was reached but I'd like to have it in stack-list and
> >>>resource-list, too).
> >>>
> >>>It makes more sense to me to call it PAUSED (we're not completely stopping
> >>>the stack creation after all, just pausing it for a bit), I'll let Steve
> >>>explain why that's not the right choice :-).
> >So, I've not got strong opinions on the name, it's more the workflow:
> >
> >1. User triggers a stack create/update
> >2. Heat walks the graph, hits a breakpoint and stops.
> >3. Heat explicitly triggers continuation of the create/update
> 
> Did you mean the user rather than Heat for (3)?

Oops, yes I did.

> >My argument is that (3) is always a stack update, either a PUT or PATCH
> >update, e.g we_are_  completely stopping stack creation, then a user can
> >choose to re-start it (either with the same or a different definition).
> 
> Hmmm, ok that's interesting. I have not been thinking of it that way. I've
> always thought of it like this:
> 
> http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/adding-lifecycle-hooks.html
> 
> (Incidentally, this suggests an implementation where the lifecycle hook is
> actually a resource - with its own API, naturally.)
> 
> So, if it's requested, before each operation we send out a notification
> (hopefully via Zaqar), and if a breakpoint is set that operation is not
> carried out until the user makes an API call acknowledging it.

I guess I was trying to keep it initially simpler than that, given that we
don't have any integration with a heat-user messaging system at present.

> >So, it_is_  really an end state, as a user might never choose to update
> >from the stopped state, in which case *_STOPPED makes more sense.
> 
> That makes a bit more sense now.
> 
> I think this is going to be really hard to implement though. Because while
> one branch of the graph stops, other branches have to continue as far as
> they can. At what point do you change the state of the stack?

True, this is a disadvantage of specifying a single breakpoint when there
may be parallel paths through the graph.

However, I was thinking we could just reuse our existing error path
implementation, so it needn't be hard to implement at all, e.g.

1. Stack action started where a resource has a breakpoint set
2. Stack.stack_task.resource_action checks if resource is a breakpoint
3. If a breakpoint is set, we raise a exception.ResourceFailure subclass
4. The normal error_wait_time is respected, e.g currently in-progress
actions are given a chance to complete.

Basically, the only implementation would be raising a special new type of
exception, which would enable a suitable message (and event) to be shown to
the user "Stack create aborted due to breakpoint on resource foo".

Pre/post breakpoint actions/messaging could be added later via a similar
method to the stack-level lifecycle plugin hooks.

If folks are happy with e.g CREATE_FAILED as a post-breakpoint state, this
could simplify things a lot, as we'd not need any new state or much new
code at all?

> >Paused implies the same action as the PATCH update, only we trigger
> >continuation of the operation from the point we reached via some sort of
> >user signal.
> >
> >If we actually pause an in-progress action via the scheduler, we'd have to
> >start worrying about stuff like token expiry, hitting timeouts, resilience
> >to engine restarts, etc, etc.  So forcing an explicit update seems simpler
> >to me.
> 
> Yes, token expiry and stack timeouts are annoying things we'd have to deal
> with. (Resilience to engine restarts is not affected though.) However, I'm
> not sure your model is simpler, and in particular it sounds much harder to
> implement in the convergence architecture.

So you're advocating keeping the scheduler spinning, until a user sends a
signal to the resource to clear the breakpoint?

I don't see why we couldn't do both, have a "abort_on_breakpoint" flag or
something, but I'd be interested in further understanding how the
error-path approach outlined above would be incompatible with convergence.

Thanks,

Steve



More information about the OpenStack-dev mailing list