Open Stack

Wed Feb 24 09:33:55 UTC 2016

On 24-Feb-16 14:26, Anant Patil wrote:
> On 24-Feb-16 13:12, Clint Byrum wrote:
>> Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
>>> Hi,
>>>
>>> I would like the discuss various approaches towards fixing bug
>>> https://launchpad.net/bugs/1533176
>>>
>>> When convergence is on, and if the stack is stuck, there is no way to
>>> cancel the existing request. This feature was not implemented in
>>> convergence, as the user can again issue an update on an in-progress
>>> stack. But if a resource worker is stuck, the new update will wait
>>> for-ever on it and the update will not be effective.
>>>
>>> The solution is to implement cancel request. Since the work for a stack
>>> is distributed among heat engines, the cancel request will not work as
>>> it does in legacy way. Many or all of the heat engines might be running
>>> worker threads to provision a stack.
>>>
>>> I could think of two options which I would like to discuss:
>>>
>>> (a) When a user triggered cancel request is received, set the stack
>>> current traversal to None or something else other than current
>>> traversal. With this the new check-resources/workers will never be
>>> triggered. This is okay as long as the worker(s) is not stuck. The
>>> existing workers will finish running, and no new check-resource
>>> (workers) will be triggered, and it will be a graceful cancel.  But the
>>> workers that are stuck will be stuck for-ever till stack times-out.  To
>>> take care of such cases, we will have to implement logic of "polling"
>>> the DB at regular intervals (may be at each step() of scheduler task)
>>> and bail out if the current traversal is updated. Basically, each worker
>>> will "poll" the DB to see if the current traversal is still valid and if
>>> not, stop itself. The drawback of this approach is that all the workers
>>> will be hitting the DB and incur a significant overhead.  Besides, all
>>> the stack workers irrespective of whether they will be cancelled or not,
>>> will keep on hitting DB. The advantage is that it probably is easier to
>>> implement. Also, if the worker is stuck in particular "step", then this
>>> approach will not work.
>>>
>>
>> I think this is the simplest option. And if the polling gets to be too
>> much, you can implement an observer pattern where one worker is just
>> assigned to poll the traversal and if it changes, RPC to the known
>> active workers that they should cancel any jobs using a now-cancelled
>> stack version.
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
> 
> Hi Clint,
> 
> I see that observer pattern is simple, but IMO it too is not efficient.
> To implement it, we will have to note down in DB the worker to engine-id
> relationship for all the workers, and then go through all of them and
> send targeted cancel messages. This will also need us to have thread
> group manager in each engine so that it can stop the thread group
> running workers for the stack.
> 
> Please help me understand if there is any particular disadvantage in
> option (b) that I am not missing.

Sorry, I meant I am missing :)

> 
> -- Anant
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

Open Stack

[openstack-dev] [heat] convergence cancel messages

OpenStack

Community

Documentation

Branding & Legal