Open Stack

Wed Feb 24 17:17:13 UTC 2016

Excerpts from Anant Patil's message of 2016-02-24 00:56:34 -0800:
> On 24-Feb-16 13:12, Clint Byrum wrote:
> > Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
> >> Hi,
> >>
> >> I would like the discuss various approaches towards fixing bug
> >> https://launchpad.net/bugs/1533176
> >>
> >> When convergence is on, and if the stack is stuck, there is no way to
> >> cancel the existing request. This feature was not implemented in
> >> convergence, as the user can again issue an update on an in-progress
> >> stack. But if a resource worker is stuck, the new update will wait
> >> for-ever on it and the update will not be effective.
> >>
> >> The solution is to implement cancel request. Since the work for a stack
> >> is distributed among heat engines, the cancel request will not work as
> >> it does in legacy way. Many or all of the heat engines might be running
> >> worker threads to provision a stack.
> >>
> >> I could think of two options which I would like to discuss:
> >>
> >> (a) When a user triggered cancel request is received, set the stack
> >> current traversal to None or something else other than current
> >> traversal. With this the new check-resources/workers will never be
> >> triggered. This is okay as long as the worker(s) is not stuck. The
> >> existing workers will finish running, and no new check-resource
> >> (workers) will be triggered, and it will be a graceful cancel.  But the
> >> workers that are stuck will be stuck for-ever till stack times-out.  To
> >> take care of such cases, we will have to implement logic of "polling"
> >> the DB at regular intervals (may be at each step() of scheduler task)
> >> and bail out if the current traversal is updated. Basically, each worker
> >> will "poll" the DB to see if the current traversal is still valid and if
> >> not, stop itself. The drawback of this approach is that all the workers
> >> will be hitting the DB and incur a significant overhead.  Besides, all
> >> the stack workers irrespective of whether they will be cancelled or not,
> >> will keep on hitting DB. The advantage is that it probably is easier to
> >> implement. Also, if the worker is stuck in particular "step", then this
> >> approach will not work.
> >>
> > 
> > I think this is the simplest option. And if the polling gets to be too
> > much, you can implement an observer pattern where one worker is just
> > assigned to poll the traversal and if it changes, RPC to the known
> > active workers that they should cancel any jobs using a now-cancelled
> > stack version.
> > 
> > __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > 
> 
> Hi Clint,
> 
> I see that observer pattern is simple, but IMO it too is not efficient.
> To implement it, we will have to note down in DB the worker to engine-id
> relationship for all the workers, and then go through all of them and
> send targeted cancel messages. This will also need us to have thread
> group manager in each engine so that it can stop the thread group
> running workers for the stack.
> 

You have to have that thread group manager anyway, or you can't ever
cancel anything in progress. That same thread group manager could also
be managing timeouts.

Apologies for my lack of understanding of where the implementation
has gone, I thought you would already have that mapping in the DB. If
that's a problem though, for this case you can have a notification
channel for cancellations, and have the management thread listen to
that, with its own local awareness of what is being worked on.

Open Stack

[openstack-dev] [heat] convergence cancel messages

OpenStack

Community

Documentation

Branding & Legal