[openstack-dev] [Heat] Where to keep data about stack breakpoints?

Tomas Sedovic tsedovic at redhat.com
Tue Jan 13 17:18:25 UTC 2015


On 01/13/2015 01:15 AM, Ton Ngo wrote:
>      I was also thinking of using the environment to hold the breakpoint,
> similarly to parameters.  The CLI and API would process it just like
> parameters.
>
>     As for the state of a stack hitting the breakpoint, leveraging the
> FAILED state seems to be sufficient, we just need to add enough information
> to differentiate between a failed resource and a resource at a breakpoint.
> Something like emitting an event or a message should be enough to make that
> distinction.   Debugger for native program typically does the same thing,
> leveraging the exception handling in the OS by inserting an artificial
> error at the breakpoint to force a program to stop.  Then the debugger
> would just remember the address of these artificial errors to decode the
> state of the stopped program.
>
>     As for the workflow, instead of spinning in the scheduler waiting for a
> signal, I was thinking of moving the stack off the engine as a failed
> stack. So this would be an end-state for the stack as Steve suggested, but
> without adding a new stack state.   Again, this is similar to how a program
> being debugged is handled:  they are moved off the ready queue and their
> context is preserved for examination.  This seems to keep the
> implementation simple and we don't have to worry about timeout,
> performance, etc.  Continuing from the breakpoint then should be similar to
> stack-update on a failed stack.  We do need some additional handling, such
> as allowing resource in-progress to run to completion instead of aborting.
>
>      For the parallel paths in a template, I am thinking about these
> alternatives:
> 1. Stop after all the current in-progress resources complete, but do not
> start any new resources even if there is no dependency.  This should be
> easier to implement, but the state of the stack would be non-deterministic.
> 2. Stop only the paths with the breakpoint, continue all other parallel
> paths to completion.  This seems harder to implement, but the stack would
> be in a deterministic state and easier for the user to reason with.
>
>     To be compatible with convergence, I had suggested to Clint earlier to
> add a mode where the convergence engine does not attempt to retry so the
> user can debug, and I believe this was added to the blueprint.
>
> Ton,


Regarding the spinning schedule, I get the token expiry and stuff, but 
it is *super simple* to implement.

Literally a while loop that yields. Two lines of code.

And we don't have to change anything in the scheduler or the way we 
handle stack or whatever. Heat already knows how to handle this situation.

Can we start with that implementation (because it's simple and correct) 
and then take it from there? Assuming we can stick to the same API/UI, 
we should be able to change it later when we've documented issues with 
the current approach.


As for parallel execution, I definitely prefer the deterministic 
approach: stop on the breakpoint and everything that depends on it, but 
resolve everything else that you can.

Again, this is trivially handled by Heat already (my patch has no 
special handling for this case). If you want to pause everything, you 
can always set up more breakpoints and advance them either manually or 
all at once with the (to be implemented) stepping functionality.

>
>
>
>
> From:	Steven Hardy <shardy at redhat.com>
> To:	"OpenStack Development Mailing List (not for usage questions)"
>              <openstack-dev at lists.openstack.org>
> Date:	01/12/2015 02:40 PM
> Subject:	Re: [openstack-dev] [Heat] Where to keep data about stack
>              breakpoints?
>
>
>
> On Mon, Jan 12, 2015 at 05:10:47PM -0500, Zane Bitter wrote:
>> On 12/01/15 13:05, Steven Hardy wrote:
>>>>> I also had a chat with Steve Hardy and he suggested adding a STOPPED
> state
>>>>> to the stack (this isn't in the spec). While not strictly necessary to
>>>>> implement the spec, this would help people figure out that the stack
> has
>>>>> reached a breakpoint instead of just waiting on a resource that takes
> a long
>>>>> time to finish (the heat-engine log and event-list still show that a
>>>>> breakpoint was reached but I'd like to have it in stack-list and
>>>>> resource-list, too).
>>>>>
>>>>> It makes more sense to me to call it PAUSED (we're not completely
> stopping
>>>>> the stack creation after all, just pausing it for a bit), I'll let
> Steve
>>>>> explain why that's not the right choice :-).
>>> So, I've not got strong opinions on the name, it's more the workflow:
>>>
>>> 1. User triggers a stack create/update
>>> 2. Heat walks the graph, hits a breakpoint and stops.
>>> 3. Heat explicitly triggers continuation of the create/update
>>
>> Did you mean the user rather than Heat for (3)?
>
> Oops, yes I did.
>
>>> My argument is that (3) is always a stack update, either a PUT or PATCH
>>> update, e.g we_are_  completely stopping stack creation, then a user can
>>> choose to re-start it (either with the same or a different definition).
>>
>> Hmmm, ok that's interesting. I have not been thinking of it that way.
> I've
>> always thought of it like this:
>>
>>
> http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/adding-lifecycle-hooks.html
>
>>
>> (Incidentally, this suggests an implementation where the lifecycle hook
> is
>> actually a resource - with its own API, naturally.)
>>
>> So, if it's requested, before each operation we send out a notification
>> (hopefully via Zaqar), and if a breakpoint is set that operation is not
>> carried out until the user makes an API call acknowledging it.
>
> I guess I was trying to keep it initially simpler than that, given that we
> don't have any integration with a heat-user messaging system at present.
>
>>> So, it_is_  really an end state, as a user might never choose to update
>> >from the stopped state, in which case *_STOPPED makes more sense.
>>
>> That makes a bit more sense now.
>>
>> I think this is going to be really hard to implement though. Because
> while
>> one branch of the graph stops, other branches have to continue as far as
>> they can. At what point do you change the state of the stack?
>
> True, this is a disadvantage of specifying a single breakpoint when there
> may be parallel paths through the graph.
>
> However, I was thinking we could just reuse our existing error path
> implementation, so it needn't be hard to implement at all, e.g.
>
> 1. Stack action started where a resource has a breakpoint set
> 2. Stack.stack_task.resource_action checks if resource is a breakpoint
> 3. If a breakpoint is set, we raise a exception.ResourceFailure subclass
> 4. The normal error_wait_time is respected, e.g currently in-progress
> actions are given a chance to complete.
>
> Basically, the only implementation would be raising a special new type of
> exception, which would enable a suitable message (and event) to be shown to
> the user "Stack create aborted due to breakpoint on resource foo".
>
> Pre/post breakpoint actions/messaging could be added later via a similar
> method to the stack-level lifecycle plugin hooks.
>
> If folks are happy with e.g CREATE_FAILED as a post-breakpoint state, this
> could simplify things a lot, as we'd not need any new state or much new
> code at all?
>
>>> Paused implies the same action as the PATCH update, only we trigger
>>> continuation of the operation from the point we reached via some sort of
>>> user signal.
>>>
>>> If we actually pause an in-progress action via the scheduler, we'd have
> to
>>> start worrying about stuff like token expiry, hitting timeouts,
> resilience
>>> to engine restarts, etc, etc.  So forcing an explicit update seems
> simpler
>>> to me.
>>
>> Yes, token expiry and stack timeouts are annoying things we'd have to
> deal
>> with. (Resilience to engine restarts is not affected though.) However,
> I'm
>> not sure your model is simpler, and in particular it sounds much harder
> to
>> implement in the convergence architecture.
>
> So you're advocating keeping the scheduler spinning, until a user sends a
> signal to the resource to clear the breakpoint?
>
> I don't see why we couldn't do both, have a "abort_on_breakpoint" flag or
> something, but I'd be interested in further understanding how the
> error-path approach outlined above would be incompatible with convergence.
>
> Thanks,
>
> Steve
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>




More information about the OpenStack-dev mailing list