[openstack-dev] [Heat] Where to keep data about stack breakpoints?

Ton Ngo ton at us.ibm.com
Tue Jan 13 00:15:59 UTC 2015

    I was also thinking of using the environment to hold the breakpoint,
similarly to parameters.  The CLI and API would process it just like

   As for the state of a stack hitting the breakpoint, leveraging the
FAILED state seems to be sufficient, we just need to add enough information
to differentiate between a failed resource and a resource at a breakpoint.
Something like emitting an event or a message should be enough to make that
distinction.   Debugger for native program typically does the same thing,
leveraging the exception handling in the OS by inserting an artificial
error at the breakpoint to force a program to stop.  Then the debugger
would just remember the address of these artificial errors to decode the
state of the stopped program.

   As for the workflow, instead of spinning in the scheduler waiting for a
signal, I was thinking of moving the stack off the engine as a failed
stack. So this would be an end-state for the stack as Steve suggested, but
without adding a new stack state.   Again, this is similar to how a program
being debugged is handled:  they are moved off the ready queue and their
context is preserved for examination.  This seems to keep the
implementation simple and we don't have to worry about timeout,
performance, etc.  Continuing from the breakpoint then should be similar to
stack-update on a failed stack.  We do need some additional handling, such
as allowing resource in-progress to run to completion instead of aborting.

    For the parallel paths in a template, I am thinking about these
1. Stop after all the current in-progress resources complete, but do not
start any new resources even if there is no dependency.  This should be
easier to implement, but the state of the stack would be non-deterministic.
2. Stop only the paths with the breakpoint, continue all other parallel
paths to completion.  This seems harder to implement, but the stack would
be in a deterministic state and easier for the user to reason with.

   To be compatible with convergence, I had suggested to Clint earlier to
add a mode where the convergence engine does not attempt to retry so the
user can debug, and I believe this was added to the blueprint.


From:	Steven Hardy <shardy at redhat.com>
To:	"OpenStack Development Mailing List (not for usage questions)"
            <openstack-dev at lists.openstack.org>
Date:	01/12/2015 02:40 PM
Subject:	Re: [openstack-dev] [Heat] Where to keep data about stack

On Mon, Jan 12, 2015 at 05:10:47PM -0500, Zane Bitter wrote:
> On 12/01/15 13:05, Steven Hardy wrote:
> >>>I also had a chat with Steve Hardy and he suggested adding a STOPPED
> >>>to the stack (this isn't in the spec). While not strictly necessary to
> >>>implement the spec, this would help people figure out that the stack
> >>>reached a breakpoint instead of just waiting on a resource that takes
a long
> >>>time to finish (the heat-engine log and event-list still show that a
> >>>breakpoint was reached but I'd like to have it in stack-list and
> >>>resource-list, too).
> >>>
> >>>It makes more sense to me to call it PAUSED (we're not completely
> >>>the stack creation after all, just pausing it for a bit), I'll let
> >>>explain why that's not the right choice :-).
> >So, I've not got strong opinions on the name, it's more the workflow:
> >
> >1. User triggers a stack create/update
> >2. Heat walks the graph, hits a breakpoint and stops.
> >3. Heat explicitly triggers continuation of the create/update
> Did you mean the user rather than Heat for (3)?

Oops, yes I did.

> >My argument is that (3) is always a stack update, either a PUT or PATCH
> >update, e.g we_are_  completely stopping stack creation, then a user can
> >choose to re-start it (either with the same or a different definition).
> Hmmm, ok that's interesting. I have not been thinking of it that way.
> always thought of it like this:

> (Incidentally, this suggests an implementation where the lifecycle hook
> actually a resource - with its own API, naturally.)
> So, if it's requested, before each operation we send out a notification
> (hopefully via Zaqar), and if a breakpoint is set that operation is not
> carried out until the user makes an API call acknowledging it.

I guess I was trying to keep it initially simpler than that, given that we
don't have any integration with a heat-user messaging system at present.

> >So, it_is_  really an end state, as a user might never choose to update
> >from the stopped state, in which case *_STOPPED makes more sense.
> That makes a bit more sense now.
> I think this is going to be really hard to implement though. Because
> one branch of the graph stops, other branches have to continue as far as
> they can. At what point do you change the state of the stack?

True, this is a disadvantage of specifying a single breakpoint when there
may be parallel paths through the graph.

However, I was thinking we could just reuse our existing error path
implementation, so it needn't be hard to implement at all, e.g.

1. Stack action started where a resource has a breakpoint set
2. Stack.stack_task.resource_action checks if resource is a breakpoint
3. If a breakpoint is set, we raise a exception.ResourceFailure subclass
4. The normal error_wait_time is respected, e.g currently in-progress
actions are given a chance to complete.

Basically, the only implementation would be raising a special new type of
exception, which would enable a suitable message (and event) to be shown to
the user "Stack create aborted due to breakpoint on resource foo".

Pre/post breakpoint actions/messaging could be added later via a similar
method to the stack-level lifecycle plugin hooks.

If folks are happy with e.g CREATE_FAILED as a post-breakpoint state, this
could simplify things a lot, as we'd not need any new state or much new
code at all?

> >Paused implies the same action as the PATCH update, only we trigger
> >continuation of the operation from the point we reached via some sort of
> >user signal.
> >
> >If we actually pause an in-progress action via the scheduler, we'd have
> >start worrying about stuff like token expiry, hitting timeouts,
> >to engine restarts, etc, etc.  So forcing an explicit update seems
> >to me.
> Yes, token expiry and stack timeouts are annoying things we'd have to
> with. (Resilience to engine restarts is not affected though.) However,
> not sure your model is simpler, and in particular it sounds much harder
> implement in the convergence architecture.

So you're advocating keeping the scheduler spinning, until a user sends a
signal to the resource to clear the breakpoint?

I don't see why we couldn't do both, have a "abort_on_breakpoint" flag or
something, but I'd be interested in further understanding how the
error-path approach outlined above would be incompatible with convergence.



OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe

More information about the OpenStack-dev mailing list