[openstack-dev] [heat] convergence cancel messages

Zane Bitter zbitter at redhat.com
Fri Apr 15 19:48:19 UTC 2016


On 15/04/16 10:58, Anant Patil wrote:
> On 14-Apr-16 23:09, Zane Bitter wrote:
>> On 11/04/16 04:51, Anant Patil wrote:
>>> After lot of ping-pong in my head, I have taken a different approach to
>>> implement stack-update-cancel when convergence is on. Polling for
>>> traversal update in each heat engine worker is not efficient method and
>>> so is the broadcasting method.
>>>
>>> In the new implementation, when a stack-cancel-update request is
>>> received, the heat engine worker will immediately cancel eventlets
>>> running locally for the stack. Then it sends cancel messages to only
>>> those heat engines who are working on the stack, one request per engine.
>>
>> I'm concerned that this is forgetting the reason we didn't implement
>> this in convergence in the first place. The purpose of
>> stack-cancel-update is to roll the stack back to its pre-update state,
>> not to unwedge blocked resources.
>>
>
> Yes, we thought this was never needed because we consciously decided
> that the concurrent update feature would suffice the needs of user.
> Exactly the reason for me to implement this so late. But there were
> questions for API compatibility, and what if user really wants to cancel
> the update, given that he/she knows the consequence of it?

Cool, we are on the same page then :)

>> The problem with just killing a thread is that the resource gets left in
>> an unknown state. (It's slightly less dangerous if you do it only during
>> sleeps, but still the state is indeterminate.) As a result, we mark all
>> such resources UPDATE_FAILED, and anything (apart from nested stacks) in
>> a FAILED state is liable to be replaced on the next update (straight
>> away in the case of a rollback). That's why in convergence we just let
>> resources run their course rather than cancelling them, and of course we
>> are able to do so because they don't block other operations on the stack
>> until they reach the point of needing to operate on that particular
>> resource.
>>
>
> The eventlet returns after each "step", so it's not that bad, but I do

Yeah, I saw you implemented it that way, and this is a *big* 
improvement. That will help avoid bugs like 
http://lists.openstack.org/pipermail/openstack-dev/2016-January/084467.html

> agree that the resource might not be in a state from where it can
> "resume", and hence the update-replace.

The issue is that Heat *always* moves the resource to FAILED and 
therefore it is *always* replaced in the future, even if it would have 
completed fine.

So doing some trivial change that is guaranteed to happen in-place could 
result in your critical resource that must never be replaced (e.g. 
Cinder volume) being replaced if you happen to cancel the update at just 
the wrong moment.

> I acknowledge your concern here,
> but I see that the user really knows that the stack is stuck because of
> some unexpected failure which heat is not aware of, and wants to cancel
> it.

I think there's two different use cases here: (1) just stop the update 
and don't start updating any more resources (and maybe roll back what 
has already been done); and (2) kill the update on this resource that is 
stuck. Using the same command for both is likely to cause trouble for 
people who were only wanting the first one.

The other option would be to have stack-cancel-update just do (1) by 
default, but add a --cancel-me-harder option that also does (2).

>> That leaves the problem of what to do when you _know_ a resource is
>> going to fail, you _want_ to replace it, and you don't want to wait for
>> the stack timeout. (In theory this problem will go away when Phase 2 of
>> convergence is fully implemented, but I agree we need a solution for
>> Phase 1.) Now that we have the mark-unhealthy API,[1] that seems to me
>> like a better candidate for the functionality to stop threads than
>> stack-cancel-update is, since its entire purpose in life is to set a
>> resource into a FAILED state so that it will get replaced on the next
>> stack update.
>>
>> So from a user's perspective, they would issue stack-cancel-update to
>> start the rollback, and iff that gets stuck waiting on a resource that
>> is doomed to fail eventually and which they just want to replace, they
>> can issue resource-mark-unhealthy to just stop that resource.
>>
>
> I was thinking of having the rollback optional while cancelling the
> update. The user may want to cancel the update and issue a new one, but
> not rollback.

+1, this is a good idea. I originally thought that you'd never want to 
leave the stack in an intermediate state, but experience with TripleO 
(which can't really do rollbacks) is that sometimes you really do just 
want to hit the panic button and stop the world :D

>> What do you think?
>>
>
> I think it is a good idea, but I see that a resource can be marked
> unhealthy only after it is done.

Currently, yes. The idea would be to change that so that if it finds the 
resource IN_PROGRESS then it kills the thread and makes sure the 
resource is in a FAILED state. I imagine/hope it wouldn't require big 
changes to your patch, mostly just changing where it's triggered from.

The trick would be if the stack update is still running and the resource 
is currently IN_PROGRESS to make sure that we fail the whole stack 
update (rolling back if the user has enabled that).

> The cancel update would take care of
> in-progress resources gone bad. I really thought the mark-unhealthy and
> stack-cancel-update were complementing features than contradictory.

I'm relaxed about whether this is implemented as part of the 
mark-unhealthy or as a non-default option to cancel-update. The main 
thing is not to put IN_PROGRESS resources into a FAILED state by default 
whenever the user cancels an update.

Reusing mark-unhealthy as the trigger for this functionality seemed 
appealing because it already has basically the semantics that are going 
to get (tell Heat to replace this resource on the next update) so there 
should be no surprises for users, and because it offers fine-grained 
control (at the resource level rather than the stack level).

cheers,
Zane.



More information about the OpenStack-dev mailing list