<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Tue, Apr 19, 2016 at 9:36 PM Zane Bitter <<a href="mailto:zbitter@redhat.com">zbitter@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 17/04/16 00:44, Anant Patil wrote:<br>

>         I think it is a good idea, but I see that a resource can be marked<br>

>         unhealthy only after it is done.<br>

><br>

><br>

>     Currently, yes. The idea would be to change that so that if it finds<br>

>     the resource IN_PROGRESS then it kills the thread and makes sure the<br>

>     resource is in a FAILED state. I<br>

><br>

><br>

> Move the resource to CHECK_FAILED?<br>

<br>

I'd say that if killing the thread gets it to UPDATE_FAILED then Mission<br>

Accomplished, but obviously we'd have to check for races and make sure<br>

we move it to CHECK_FAILED if the update completes successfully.<br>

<br>

>     The trick would be if the stack update is still running and the<br>

>     resource is currently IN_PROGRESS to make sure that we fail the<br>

>     whole stack update (rolling back if the user has enabled that).<br>

><br>

><br>

> IMO, we can probably use the cancel  command do this, because when you<br>

> are marking a resource as unhealthy, you are<br>

> cancelling any action running on that resource. Would the following be ok?<br>

> (1) stack-cancel-update <stack_id> will cancel the update, mark<br>

> cancelled resources failed and rollback (existing stuff)<br>

> (2) stack-cancel-update <stack_id> --no-rollback will just cancel the<br>

> update and mark cancelled resources as failed<br>

> (3) stack-cancel-update <stack_id> <resource_id> ... <resource_id> Just<br>

> stop the action on given resources, mark as CHECK_FAILED, don't do<br>

> anything else. The stack won't progress further. Other resources running<br>

> while cancel-update will complete.<br>

<br>

None of those solve the use case I actually care about, which is "don't<br>

start any more resource updates, but don't mark the ones currently<br>

in-progress as failed either, and don't roll back". That would be a huge<br>

help in TripleO. We need a way to be able to stop updates that<br>

guarantees not unnecessarily destroying any part of the existing stack,<br>

and we need that to be the default.<br>

<br>

(We sort-of have the rollback version of this; it's equivalent to a<br>

stack update with the previous template/environment. But we need to make<br>

it easier and decouple it from the rollback IMHO.)<br>

<br>

So one way to do this would be:<br>

<br>

(1) stack-cancel-update <stack_id> will start another update using the<br>

previous template/environment. We'll start rolling back; in-progress<br>

resources will be allowed to complete normally.<br>

(2) stack-cancel-update <stack_id> --no-rollback will set the<br>

traversal_id to None so no further resources will be updated;<br>

in-progress resources will be allowed to complete normally.<br>

(3) stack-cancel-update <stack_id> --stop-in-progress will stop the<br>

traversal, kill any running threads update (marking cancelled resources<br>

failed) and rollback<br>

(4) stack-cancel-update <stack_id> --stop-in-progress --no-rollback will<br>

just stop the traversal, kill any running threads update (marking<br>

cancelled resources failed)<br>

(5) stack-cancel-update <stack_id> --stop-in-progress <resource_id> ...<br>

<resource_id> Just stop the action on given resources, mark as<br>

UPDATE_FAILED, don't do anything else. The stack won't progress further.<br>

Other resources running while cancel-update will complete.<br>

<br>

That would cover all the use cases. Some problems with it are:<br>

- It's way complicated. Lots of options.<br>

- Those options don't translate well to legacy (pre-convergence) stacks<br>

using the same client. e.g. there is now a non-default<br>

--stop-in-progress option, but on legacy stacks we always stop in-progress.<br>

- Options don't commute. When you specify resources with the<br>

--stop-in-progress flag it never rolls back, even though you haven't set<br>

the --no-rollback flag.<br>

<br>

An alternative would be to just drop (3) and (4), and maybe rename (5).<br>

I'd be OK with that:<br>

<br>

(1) stack-cancel-update <stack_id> will start another update using the<br>

previous template/environment. We'll start rolling back; in-progress<br>

resources will be allowed to complete normally.<br>

(2) stack-cancel-update <stack_id> --no-rollback will set the<br>

traversal_id to None so no further resources will be updated;<br>

in-progress resources will be allowed to complete normally.<br>

(3) resource-stop-update <stack_id> <resource_id> ... <resource_id> Just<br>

stop the action on given resources, mark as UPDATE_FAILED, don't do<br>

anything else. The stack won't progress further. Other resources running<br>

while cancel-update will complete.<br>

<br>

That solves most of the issues, except that (3) has no real equivalent<br>

on legacy stacks (I guess we could just make it fail on the server side).<br>

<br>

What I'm suggesting is very close to that:<br>

<br>

(1) stack-cancel-update <stack_id> will start another update using the<br>

previous template/environment. We'll start rolling back; in-progress<br>

resources will be allowed to complete normally.<br>

(2) stack-cancel-update <stack_id> --no-rollback will set the<br>

traversal_id to None so no further resources will be updated;<br>

in-progress resources will be allowed to complete normally.<br>

(3) resource-mark-unhealthy <stack_id> <resource_id> ... <resource_id><br>

Kill any threads running a CREATE or UPDATE on the given resources, mark<br>

as CHECK_FAILED if they are not already in UPDATE_FAILED, don't do<br>

anything else. If the resource was in progress, the stack won't progress<br>

further, other resources currently in-progress will complete, and if<br>

rollback is enabled and no other traversal has started then it will roll<br>

back to the previous template/environment.<br>

<br></blockquote><div>I have started implementation of the above three mechanisms. The first two are implemented in <a href="https://review.openstack.org/#/c/357618">https://review.openstack.org/#/c/357618</a></div><div>Note that the (2) needs a change in heat client (openstack client?) to have a --no-rollback option.</div><div>(3) is a bit of long haul, and needs:</div><div><a href="https://review.openstack.org/343076">https://review.openstack.org/343076</a> : Adds mechanism to interrupt convergence worker threads<br></div><div><a href="https://review.openstack.org/301483">https://review.openstack.org/301483</a> : Mechanism to send cancel message and cancel worker upon receiving messages<br></div><div>Apart from the above two, I am implementing the actual patch which will leverage the above two to complete resource-mark-unhealthy feature in convergence.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Basically this rolls the functionality of resource-stop-update into<br>

resource-mark-unhealthy instead of making a separate command for it. The<br>

only real difference is that the resource _always_ ends up in a failed<br>

state even if it had actually completed before the command was<br>

processed. (In practice this is likely to be irrelevant, because you'd<br>

used resource-stop-update only when something was stuck.) I like this<br>

because in each case under convergence the command acts like a<br>

convergified (yes, I just said that) version of the legacy behaviour:<br>

<br>

(1) In the legacy path we use stack-level locks, so to start a rollback<br>

we have to kill the current update. In convergence, we just start the<br>

rollback update.<br>

(2) There's no current equivalent of this, but it would be trivial (and<br>

useful) to add - the RPC API already supports it, so we just need to<br>

implement it in the ReST API and client. In both cases, it does exactly<br>

what it says on the tin: acts the same as (1) but without the rollback.<br>

(3) In the legacy path you can't issue this command during a stack<br>

update due to the stack-level lock, but in convergence without this lock<br>

you can do it any time. If a resource is in-progress when you mark it<br>

unhealthy then we just stop it because it's going to a FAILED state<br>

regardless. The stack update behaves normally - if a resource fails for<br>

any reason, roll back iff rollback is enabled.<br>

<br>

One caveat is that my brain thinks of convergence phase 1 exclusively in<br>

terms of replacing stack-level locks with resource-level locks. It's<br>

likely users don't think about it this way. However, I still think it's<br>

a coherent design, and it avoids adding an extra command to the CLI that<br>

does almost the same thing as an existing one.<br>

<br>

Note that this is actually probably the behaviour we want for<br>

resource-mark-unhealthy anyway, because that is likely to be called in<br>

many cases by some external monitoring tool, so it would be better if it<br>

took effect regardless of what is happening in the stack at the time. We<br>

can kill two birds with one stone.<br>

<br>

cheers,<br>

Zane.<br>

<br>

__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a></blockquote><div><br></div><div>Thanks,</div><div>Anant </div></div></div>