[openstack-dev] [heat] convergence cancel messages

Zane Bitter zbitter at redhat.com
Tue Apr 19 16:00:36 UTC 2016


On 17/04/16 00:44, Anant Patil wrote:
>         I think it is a good idea, but I see that a resource can be marked
>         unhealthy only after it is done.
>
>
>     Currently, yes. The idea would be to change that so that if it finds
>     the resource IN_PROGRESS then it kills the thread and makes sure the
>     resource is in a FAILED state. I
>
>
> Move the resource to CHECK_FAILED?

I'd say that if killing the thread gets it to UPDATE_FAILED then Mission 
Accomplished, but obviously we'd have to check for races and make sure 
we move it to CHECK_FAILED if the update completes successfully.

>     The trick would be if the stack update is still running and the
>     resource is currently IN_PROGRESS to make sure that we fail the
>     whole stack update (rolling back if the user has enabled that).
>
>
> IMO, we can probably use the cancel  command do this, because when you
> are marking a resource as unhealthy, you are
> cancelling any action running on that resource. Would the following be ok?
> (1) stack-cancel-update <stack_id> will cancel the update, mark
> cancelled resources failed and rollback (existing stuff)
> (2) stack-cancel-update <stack_id> --no-rollback will just cancel the
> update and mark cancelled resources as failed
> (3) stack-cancel-update <stack_id> <resource_id> ... <resource_id> Just
> stop the action on given resources, mark as CHECK_FAILED, don't do
> anything else. The stack won't progress further. Other resources running
> while cancel-update will complete.

None of those solve the use case I actually care about, which is "don't 
start any more resource updates, but don't mark the ones currently 
in-progress as failed either, and don't roll back". That would be a huge 
help in TripleO. We need a way to be able to stop updates that 
guarantees not unnecessarily destroying any part of the existing stack, 
and we need that to be the default.

(We sort-of have the rollback version of this; it's equivalent to a 
stack update with the previous template/environment. But we need to make 
it easier and decouple it from the rollback IMHO.)

So one way to do this would be:

(1) stack-cancel-update <stack_id> will start another update using the 
previous template/environment. We'll start rolling back; in-progress 
resources will be allowed to complete normally.
(2) stack-cancel-update <stack_id> --no-rollback will set the 
traversal_id to None so no further resources will be updated; 
in-progress resources will be allowed to complete normally.
(3) stack-cancel-update <stack_id> --stop-in-progress will stop the 
traversal, kill any running threads update (marking cancelled resources 
failed) and rollback
(4) stack-cancel-update <stack_id> --stop-in-progress --no-rollback will 
just stop the traversal, kill any running threads update (marking 
cancelled resources failed)
(5) stack-cancel-update <stack_id> --stop-in-progress <resource_id> ... 
<resource_id> Just stop the action on given resources, mark as 
UPDATE_FAILED, don't do anything else. The stack won't progress further. 
Other resources running while cancel-update will complete.

That would cover all the use cases. Some problems with it are:
- It's way complicated. Lots of options.
- Those options don't translate well to legacy (pre-convergence) stacks 
using the same client. e.g. there is now a non-default 
--stop-in-progress option, but on legacy stacks we always stop in-progress.
- Options don't commute. When you specify resources with the 
--stop-in-progress flag it never rolls back, even though you haven't set 
the --no-rollback flag.

An alternative would be to just drop (3) and (4), and maybe rename (5). 
I'd be OK with that:

(1) stack-cancel-update <stack_id> will start another update using the 
previous template/environment. We'll start rolling back; in-progress 
resources will be allowed to complete normally.
(2) stack-cancel-update <stack_id> --no-rollback will set the 
traversal_id to None so no further resources will be updated; 
in-progress resources will be allowed to complete normally.
(3) resource-stop-update <stack_id> <resource_id> ... <resource_id> Just 
stop the action on given resources, mark as UPDATE_FAILED, don't do 
anything else. The stack won't progress further. Other resources running 
while cancel-update will complete.

That solves most of the issues, except that (3) has no real equivalent 
on legacy stacks (I guess we could just make it fail on the server side).

What I'm suggesting is very close to that:

(1) stack-cancel-update <stack_id> will start another update using the 
previous template/environment. We'll start rolling back; in-progress 
resources will be allowed to complete normally.
(2) stack-cancel-update <stack_id> --no-rollback will set the 
traversal_id to None so no further resources will be updated; 
in-progress resources will be allowed to complete normally.
(3) resource-mark-unhealthy <stack_id> <resource_id> ... <resource_id> 
Kill any threads running a CREATE or UPDATE on the given resources, mark 
as CHECK_FAILED if they are not already in UPDATE_FAILED, don't do 
anything else. If the resource was in progress, the stack won't progress 
further, other resources currently in-progress will complete, and if 
rollback is enabled and no other traversal has started then it will roll 
back to the previous template/environment.

Basically this rolls the functionality of resource-stop-update into 
resource-mark-unhealthy instead of making a separate command for it. The 
only real difference is that the resource _always_ ends up in a failed 
state even if it had actually completed before the command was 
processed. (In practice this is likely to be irrelevant, because you'd 
used resource-stop-update only when something was stuck.) I like this 
because in each case under convergence the command acts like a 
convergified (yes, I just said that) version of the legacy behaviour:

(1) In the legacy path we use stack-level locks, so to start a rollback 
we have to kill the current update. In convergence, we just start the 
rollback update.
(2) There's no current equivalent of this, but it would be trivial (and 
useful) to add - the RPC API already supports it, so we just need to 
implement it in the ReST API and client. In both cases, it does exactly 
what it says on the tin: acts the same as (1) but without the rollback.
(3) In the legacy path you can't issue this command during a stack 
update due to the stack-level lock, but in convergence without this lock 
you can do it any time. If a resource is in-progress when you mark it 
unhealthy then we just stop it because it's going to a FAILED state 
regardless. The stack update behaves normally - if a resource fails for 
any reason, roll back iff rollback is enabled.

One caveat is that my brain thinks of convergence phase 1 exclusively in 
terms of replacing stack-level locks with resource-level locks. It's 
likely users don't think about it this way. However, I still think it's 
a coherent design, and it avoids adding an extra command to the CLI that 
does almost the same thing as an existing one.

Note that this is actually probably the behaviour we want for 
resource-mark-unhealthy anyway, because that is likely to be called in 
many cases by some external monitoring tool, so it would be better if it 
took effect regardless of what is happening in the stack at the time. We 
can kill two birds with one stone.

cheers,
Zane.



More information about the OpenStack-dev mailing list