[openstack-dev] [heat] convergence cancel messages

Anant Patil anant.techie at gmail.com
Sun Apr 17 04:44:13 UTC 2016

On Sat, Apr 16, 2016 at 1:18 AM, Zane Bitter <zbitter at redhat.com> wrote:

> On 15/04/16 10:58, Anant Patil wrote:
>> On 14-Apr-16 23:09, Zane Bitter wrote:
>>> On 11/04/16 04:51, Anant Patil wrote:
>>>> After lot of ping-pong in my head, I have taken a different approach to
>>>> implement stack-update-cancel when convergence is on. Polling for
>>>> traversal update in each heat engine worker is not efficient method and
>>>> so is the broadcasting method.
>>>> In the new implementation, when a stack-cancel-update request is
>>>> received, the heat engine worker will immediately cancel eventlets
>>>> running locally for the stack. Then it sends cancel messages to only
>>>> those heat engines who are working on the stack, one request per engine.
>>> I'm concerned that this is forgetting the reason we didn't implement
>>> this in convergence in the first place. The purpose of
>>> stack-cancel-update is to roll the stack back to its pre-update state,
>>> not to unwedge blocked resources.
>> Yes, we thought this was never needed because we consciously decided
>> that the concurrent update feature would suffice the needs of user.
>> Exactly the reason for me to implement this so late. But there were
>> questions for API compatibility, and what if user really wants to cancel
>> the update, given that he/she knows the consequence of it?
> Cool, we are on the same page then :)
> The problem with just killing a thread is that the resource gets left in
>>> an unknown state. (It's slightly less dangerous if you do it only during
>>> sleeps, but still the state is indeterminate.) As a result, we mark all
>>> such resources UPDATE_FAILED, and anything (apart from nested stacks) in
>>> a FAILED state is liable to be replaced on the next update (straight
>>> away in the case of a rollback). That's why in convergence we just let
>>> resources run their course rather than cancelling them, and of course we
>>> are able to do so because they don't block other operations on the stack
>>> until they reach the point of needing to operate on that particular
>>> resource.
>> The eventlet returns after each "step", so it's not that bad, but I do
> Yeah, I saw you implemented it that way, and this is a *big* improvement.
> That will help avoid bugs like
> http://lists.openstack.org/pipermail/openstack-dev/2016-January/084467.html
> agree that the resource might not be in a state from where it can
>> "resume", and hence the update-replace.
> The issue is that Heat *always* moves the resource to FAILED and therefore
> it is *always* replaced in the future, even if it would have completed fine.
> So doing some trivial change that is guaranteed to happen in-place could
> result in your critical resource that must never be replaced (e.g. Cinder
> volume) being replaced if you happen to cancel the update at just the wrong
> moment.

I agree with you for the need to have a mechanism to just stop doing the
update (or whatever heat was doing to that resource :))

I acknowledge your concern here,
>> but I see that the user really knows that the stack is stuck because of
>> some unexpected failure which heat is not aware of, and wants to cancel
>> it.
> I think there's two different use cases here: (1) just stop the update and
> don't start updating any more resources (and maybe roll back what has
> already been done); and (2) kill the update on this resource that is stuck.
> Using the same command for both is likely to cause trouble for people who
> were only wanting the first one.
> The other option would be to have stack-cancel-update just do (1) by
> default, but add a --cancel-me-harder option that also does (2).

> That leaves the problem of what to do when you _know_ a resource is
>>> going to fail, you _want_ to replace it, and you don't want to wait for
>>> the stack timeout. (In theory this problem will go away when Phase 2 of
>>> convergence is fully implemented, but I agree we need a solution for
>>> Phase 1.) Now that we have the mark-unhealthy API,[1] that seems to me
>>> like a better candidate for the functionality to stop threads than
>>> stack-cancel-update is, since its entire purpose in life is to set a
>>> resource into a FAILED state so that it will get replaced on the next
>>> stack update.
>>> So from a user's perspective, they would issue stack-cancel-update to
>>> start the rollback, and iff that gets stuck waiting on a resource that
>>> is doomed to fail eventually and which they just want to replace, they
>>> can issue resource-mark-unhealthy to just stop that resource.
>> I was thinking of having the rollback optional while cancelling the
>> update. The user may want to cancel the update and issue a new one, but
>> not rollback.
> +1, this is a good idea. I originally thought that you'd never want to
> leave the stack in an intermediate state, but experience with TripleO
> (which can't really do rollbacks) is that sometimes you really do just want
> to hit the panic button and stop the world :D

Yeah, I have heard folks wanting to just cancel and nothing else.

> What do you think?
>> I think it is a good idea, but I see that a resource can be marked
>> unhealthy only after it is done.
> Currently, yes. The idea would be to change that so that if it finds the
> resource IN_PROGRESS then it kills the thread and makes sure the resource
> is in a FAILED state. I

Move the resource to CHECK_FAILED?

> imagine/hope it wouldn't require big changes to your patch, mostly just
> changing where it's triggered from.

I will be more comfortable submitting another patch to implement this
feature :)

The trick would be if the stack update is still running and the resource is
> currently IN_PROGRESS to make sure that we fail the whole stack update
> (rolling back if the user has enabled that).
IMO, we can probably use the cancel  command do this, because when you are
marking a resource as unhealthy, you are
cancelling any action running on that resource. Would the following be ok?
(1) stack-cancel-update <stack_id> will cancel the update, mark cancelled
resources failed and rollback (existing stuff)
(2) stack-cancel-update <stack_id> --no-rollback will just cancel the
update and mark cancelled resources as failed
(3) stack-cancel-update <stack_id> <resource_id> ... <resource_id> Just
stop the action on given resources, mark as CHECK_FAILED, don't do anything
else. The stack won't progress further. Other resources running while
cancel-update will complete.

(3) is like mark unhealthy when stack is IN_PROGRESS.
Also, IMO it doesn't make any sense to run (3) with rollback, as the user
just wants to stop some resources. Please correct me if I am wrong.

> The cancel update would take care of
>> in-progress resources gone bad. I really thought the mark-unhealthy and
>> stack-cancel-update were complementing features than contradictory.
> I'm relaxed about whether this is implemented as part of the
> mark-unhealthy or as a non-default option to cancel-update. The main thing
> is not to put IN_PROGRESS resources into a FAILED state by default whenever
> the user cancels an update.
> Reusing mark-unhealthy as the trigger for this functionality seemed
> appealing because it already has basically the semantics that are going to
> get (tell Heat to replace this resource on the next update) so there should
> be no surprises for users, and because it offers fine-grained control (at
> the resource level rather than the stack level).

I agree.

> cheers,
> Zane.
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160417/90b694fe/attachment.html>

More information about the OpenStack-dev mailing list