<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Tue, Apr 19, 2016 at 9:36 PM Zane Bitter <<a href="mailto:zbitter@redhat.com">zbitter@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 17/04/16 00:44, Anant Patil wrote:<br>
> I think it is a good idea, but I see that a resource can be marked<br>
> unhealthy only after it is done.<br>
><br>
><br>
> Currently, yes. The idea would be to change that so that if it finds<br>
> the resource IN_PROGRESS then it kills the thread and makes sure the<br>
> resource is in a FAILED state. I<br>
><br>
><br>
> Move the resource to CHECK_FAILED?<br>
<br>
I'd say that if killing the thread gets it to UPDATE_FAILED then Mission<br>
Accomplished, but obviously we'd have to check for races and make sure<br>
we move it to CHECK_FAILED if the update completes successfully.<br>
<br>
> The trick would be if the stack update is still running and the<br>
> resource is currently IN_PROGRESS to make sure that we fail the<br>
> whole stack update (rolling back if the user has enabled that).<br>
><br>
><br>
> IMO, we can probably use the cancel command do this, because when you<br>
> are marking a resource as unhealthy, you are<br>
> cancelling any action running on that resource. Would the following be ok?<br>
> (1) stack-cancel-update <stack_id> will cancel the update, mark<br>
> cancelled resources failed and rollback (existing stuff)<br>
> (2) stack-cancel-update <stack_id> --no-rollback will just cancel the<br>
> update and mark cancelled resources as failed<br>
> (3) stack-cancel-update <stack_id> <resource_id> ... <resource_id> Just<br>
> stop the action on given resources, mark as CHECK_FAILED, don't do<br>
> anything else. The stack won't progress further. Other resources running<br>
> while cancel-update will complete.<br>
<br>
None of those solve the use case I actually care about, which is "don't<br>
start any more resource updates, but don't mark the ones currently<br>
in-progress as failed either, and don't roll back". That would be a huge<br>
help in TripleO. We need a way to be able to stop updates that<br>
guarantees not unnecessarily destroying any part of the existing stack,<br>
and we need that to be the default.<br>
<br>
(We sort-of have the rollback version of this; it's equivalent to a<br>
stack update with the previous template/environment. But we need to make<br>
it easier and decouple it from the rollback IMHO.)<br>
<br>
So one way to do this would be:<br>
<br>
(1) stack-cancel-update <stack_id> will start another update using the<br>
previous template/environment. We'll start rolling back; in-progress<br>
resources will be allowed to complete normally.<br>
(2) stack-cancel-update <stack_id> --no-rollback will set the<br>
traversal_id to None so no further resources will be updated;<br>
in-progress resources will be allowed to complete normally.<br>
(3) stack-cancel-update <stack_id> --stop-in-progress will stop the<br>
traversal, kill any running threads update (marking cancelled resources<br>
failed) and rollback<br>
(4) stack-cancel-update <stack_id> --stop-in-progress --no-rollback will<br>
just stop the traversal, kill any running threads update (marking<br>
cancelled resources failed)<br>
(5) stack-cancel-update <stack_id> --stop-in-progress <resource_id> ...<br>
<resource_id> Just stop the action on given resources, mark as<br>
UPDATE_FAILED, don't do anything else. The stack won't progress further.<br>
Other resources running while cancel-update will complete.<br>
<br>
That would cover all the use cases. Some problems with it are:<br>
- It's way complicated. Lots of options.<br>
- Those options don't translate well to legacy (pre-convergence) stacks<br>
using the same client. e.g. there is now a non-default<br>
--stop-in-progress option, but on legacy stacks we always stop in-progress.<br>
- Options don't commute. When you specify resources with the<br>
--stop-in-progress flag it never rolls back, even though you haven't set<br>
the --no-rollback flag.<br>
<br>
An alternative would be to just drop (3) and (4), and maybe rename (5).<br>
I'd be OK with that:<br>
<br>
(1) stack-cancel-update <stack_id> will start another update using the<br>
previous template/environment. We'll start rolling back; in-progress<br>
resources will be allowed to complete normally.<br>
(2) stack-cancel-update <stack_id> --no-rollback will set the<br>
traversal_id to None so no further resources will be updated;<br>
in-progress resources will be allowed to complete normally.<br>
(3) resource-stop-update <stack_id> <resource_id> ... <resource_id> Just<br>
stop the action on given resources, mark as UPDATE_FAILED, don't do<br>
anything else. The stack won't progress further. Other resources running<br>
while cancel-update will complete.<br>
<br>
That solves most of the issues, except that (3) has no real equivalent<br>
on legacy stacks (I guess we could just make it fail on the server side).<br>
<br>
What I'm suggesting is very close to that:<br>
<br>
(1) stack-cancel-update <stack_id> will start another update using the<br>
previous template/environment. We'll start rolling back; in-progress<br>
resources will be allowed to complete normally.<br>
(2) stack-cancel-update <stack_id> --no-rollback will set the<br>
traversal_id to None so no further resources will be updated;<br>
in-progress resources will be allowed to complete normally.<br>
(3) resource-mark-unhealthy <stack_id> <resource_id> ... <resource_id><br>
Kill any threads running a CREATE or UPDATE on the given resources, mark<br>
as CHECK_FAILED if they are not already in UPDATE_FAILED, don't do<br>
anything else. If the resource was in progress, the stack won't progress<br>
further, other resources currently in-progress will complete, and if<br>
rollback is enabled and no other traversal has started then it will roll<br>
back to the previous template/environment.<br>
<br></blockquote><div>I have started implementation of the above three mechanisms. The first two are implemented in <a href="https://review.openstack.org/#/c/357618">https://review.openstack.org/#/c/357618</a></div><div>Note that the (2) needs a change in heat client (openstack client?) to have a --no-rollback option.</div><div>(3) is a bit of long haul, and needs:</div><div><a href="https://review.openstack.org/343076">https://review.openstack.org/343076</a> : Adds mechanism to interrupt convergence worker threads<br></div><div><a href="https://review.openstack.org/301483">https://review.openstack.org/301483</a> : Mechanism to send cancel message and cancel worker upon receiving messages<br></div><div>Apart from the above two, I am implementing the actual patch which will leverage the above two to complete resource-mark-unhealthy feature in convergence.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Basically this rolls the functionality of resource-stop-update into<br>
resource-mark-unhealthy instead of making a separate command for it. The<br>
only real difference is that the resource _always_ ends up in a failed<br>
state even if it had actually completed before the command was<br>
processed. (In practice this is likely to be irrelevant, because you'd<br>
used resource-stop-update only when something was stuck.) I like this<br>
because in each case under convergence the command acts like a<br>
convergified (yes, I just said that) version of the legacy behaviour:<br>
<br>
(1) In the legacy path we use stack-level locks, so to start a rollback<br>
we have to kill the current update. In convergence, we just start the<br>
rollback update.<br>
(2) There's no current equivalent of this, but it would be trivial (and<br>
useful) to add - the RPC API already supports it, so we just need to<br>
implement it in the ReST API and client. In both cases, it does exactly<br>
what it says on the tin: acts the same as (1) but without the rollback.<br>
(3) In the legacy path you can't issue this command during a stack<br>
update due to the stack-level lock, but in convergence without this lock<br>
you can do it any time. If a resource is in-progress when you mark it<br>
unhealthy then we just stop it because it's going to a FAILED state<br>
regardless. The stack update behaves normally - if a resource fails for<br>
any reason, roll back iff rollback is enabled.<br>
<br>
One caveat is that my brain thinks of convergence phase 1 exclusively in<br>
terms of replacing stack-level locks with resource-level locks. It's<br>
likely users don't think about it this way. However, I still think it's<br>
a coherent design, and it avoids adding an extra command to the CLI that<br>
does almost the same thing as an existing one.<br>
<br>
Note that this is actually probably the behaviour we want for<br>
resource-mark-unhealthy anyway, because that is likely to be called in<br>
many cases by some external monitoring tool, so it would be better if it<br>
took effect regardless of what is happening in the stack at the time. We<br>
can kill two birds with one stone.<br>
<br>
cheers,<br>
Zane.<br>
<br>
__________________________________________________________________________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a></blockquote><div><br></div><div>Thanks,</div><div>Anant </div></div></div>