<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Apr 16, 2016 at 1:18 AM, Zane Bitter <span dir="ltr"><<a href="mailto:zbitter@redhat.com" target="_blank">zbitter@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">On 15/04/16 10:58, Anant Patil wrote:<br>

</span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">

On 14-Apr-16 23:09, Zane Bitter wrote:<br>

</span><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">

On 11/04/16 04:51, Anant Patil wrote:<br>

</span><span class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

After lot of ping-pong in my head, I have taken a different approach to<br>

implement stack-update-cancel when convergence is on. Polling for<br>

traversal update in each heat engine worker is not efficient method and<br>

so is the broadcasting method.<br>

<br>

In the new implementation, when a stack-cancel-update request is<br>

received, the heat engine worker will immediately cancel eventlets<br>

running locally for the stack. Then it sends cancel messages to only<br>

those heat engines who are working on the stack, one request per engine.<br>

</blockquote>

<br>

I'm concerned that this is forgetting the reason we didn't implement<br>

this in convergence in the first place. The purpose of<br>

stack-cancel-update is to roll the stack back to its pre-update state,<br>

not to unwedge blocked resources.<br>

<br>

</span></blockquote><span class="">

<br>

Yes, we thought this was never needed because we consciously decided<br>

that the concurrent update feature would suffice the needs of user.<br>

Exactly the reason for me to implement this so late. But there were<br>

questions for API compatibility, and what if user really wants to cancel<br>

the update, given that he/she knows the consequence of it?<br>

</span></blockquote>

<br>

Cool, we are on the same page then :)<span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

The problem with just killing a thread is that the resource gets left in<br>

an unknown state. (It's slightly less dangerous if you do it only during<br>

sleeps, but still the state is indeterminate.) As a result, we mark all<br>

such resources UPDATE_FAILED, and anything (apart from nested stacks) in<br>

a FAILED state is liable to be replaced on the next update (straight<br>

away in the case of a rollback). That's why in convergence we just let<br>

resources run their course rather than cancelling them, and of course we<br>

are able to do so because they don't block other operations on the stack<br>

until they reach the point of needing to operate on that particular<br>

resource.<br>

<br>

</blockquote>

<br>

The eventlet returns after each "step", so it's not that bad, but I do<br>

</blockquote>

<br></span>

Yeah, I saw you implemented it that way, and this is a *big* improvement. That will help avoid bugs like <a href="http://lists.openstack.org/pipermail/openstack-dev/2016-January/084467.html" rel="noreferrer" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2016-January/084467.html</a><span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

agree that the resource might not be in a state from where it can<br>

"resume", and hence the update-replace.<br>

</blockquote>

<br></span>

The issue is that Heat *always* moves the resource to FAILED and therefore it is *always* replaced in the future, even if it would have completed fine.<br>

<br>

So doing some trivial change that is guaranteed to happen in-place could result in your critical resource that must never be replaced (e.g. Cinder volume) being replaced if you happen to cancel the update at just the wrong moment.</blockquote><div><br></div><div>I agree with you for the need to have a mechanism to just stop doing the update (or whatever heat was doing to that resource :)) <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class=""> <br></span></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

I acknowledge your concern here,<br>

but I see that the user really knows that the stack is stuck because of<br>

some unexpected failure which heat is not aware of, and wants to cancel<br>

it.<br>

</blockquote>

<br></span>

I think there's two different use cases here: (1) just stop the update and don't start updating any more resources (and maybe roll back what has already been done); and (2) kill the update on this resource that is stuck. Using the same command for both is likely to cause trouble for people who were only wanting the first one.<br>

<br>

The other option would be to have stack-cancel-update just do (1) by default, but add a --cancel-me-harder option that also does (2).</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

That leaves the problem of what to do when you _know_ a resource is<br>

going to fail, you _want_ to replace it, and you don't want to wait for<br>

the stack timeout. (In theory this problem will go away when Phase 2 of<br>

convergence is fully implemented, but I agree we need a solution for<br>

Phase 1.) Now that we have the mark-unhealthy API,[1] that seems to me<br>

like a better candidate for the functionality to stop threads than<br>

stack-cancel-update is, since its entire purpose in life is to set a<br>

resource into a FAILED state so that it will get replaced on the next<br>

stack update.<br>

<br>

So from a user's perspective, they would issue stack-cancel-update to<br>

start the rollback, and iff that gets stuck waiting on a resource that<br>

is doomed to fail eventually and which they just want to replace, they<br>

can issue resource-mark-unhealthy to just stop that resource.<br>

<br>

</blockquote>

<br>

I was thinking of having the rollback optional while cancelling the<br>

update. The user may want to cancel the update and issue a new one, but<br>

not rollback.<br>

</blockquote>

<br></span>

+1, this is a good idea. I originally thought that you'd never want to leave the stack in an intermediate state, but experience with TripleO (which can't really do rollbacks) is that sometimes you really do just want to hit the panic button and stop the world :D</blockquote><div><br></div><div>Yeah, I have heard folks wanting to just cancel and nothing else.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class=""><br>

<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

What do you think?<br>

<br>

</blockquote>

<br>

I think it is a good idea, but I see that a resource can be marked<br>

unhealthy only after it is done.<br>

</blockquote>

<br></span>

Currently, yes. The idea would be to change that so that if it finds the resource IN_PROGRESS then it kills the thread and makes sure the resource is in a FAILED state. I </blockquote><div><br></div><div>Move the resource to CHECK_FAILED?</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">imagine/hope it wouldn't require big changes to your patch, mostly just changing where it's triggered from.</blockquote><div><br></div><div>I will be more comfortable submitting another patch to implement this feature :)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

The trick would be if the stack update is still running and the resource is currently IN_PROGRESS to make sure that we fail the whole stack update (rolling back if the user has enabled that).<span class=""><br>

<br></span></blockquote><div><br></div><div>IMO, we can probably use the cancel  command do this, because when you are marking a resource as unhealthy, you are</div><div>cancelling any action running on that resource. Would the following be ok?</div><div>(1) stack-cancel-update <stack_id> will cancel the update, mark cancelled resources failed and rollback (existing stuff)</div><div>(2) stack-cancel-update <stack_id> --no-rollback will just cancel the update and mark cancelled resources as failed</div><div>(3) stack-cancel-update <stack_id> <resource_id> ... <resource_id> Just stop the action on given resources, mark as CHECK_FAILED, don't do anything else. The stack won't progress further. Other resources running while cancel-update will complete.</div><div><br></div><div>(3) is like mark unhealthy when stack is IN_PROGRESS.</div><div>Also, IMO it doesn't make any sense to run (3) with rollback, as the user just wants to stop some resources. Please correct me if I am wrong.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span class="">

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

The cancel update would take care of<br>

in-progress resources gone bad. I really thought the mark-unhealthy and<br>

stack-cancel-update were complementing features than contradictory.<br>

</blockquote>

<br></span>

I'm relaxed about whether this is implemented as part of the mark-unhealthy or as a non-default option to cancel-update. The main thing is not to put IN_PROGRESS resources into a FAILED state by default whenever the user cancels an update.<br>

<br>

Reusing mark-unhealthy as the trigger for this functionality seemed appealing because it already has basically the semantics that are going to get (tell Heat to replace this resource on the next update) so there should be no surprises for users, and because it offers fine-grained control (at the resource level rather than the stack level).<br></blockquote><div><br></div><div>I agree. </div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

cheers,<br>

Zane.<div class=""><div class="h5"><br>

<br>

__________________________________________________________________________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</div></div></blockquote></div><br></div></div>