[openstack-dev] [heat] convergence cancel messages
Clint Byrum
clint at fewbar.com
Wed Feb 24 17:18:50 UTC 2016
Excerpts from Anant Patil's message of 2016-02-23 23:08:31 -0800:
> Hi,
>
> I would like the discuss various approaches towards fixing bug
> https://launchpad.net/bugs/1533176
>
> When convergence is on, and if the stack is stuck, there is no way to
> cancel the existing request. This feature was not implemented in
> convergence, as the user can again issue an update on an in-progress
> stack. But if a resource worker is stuck, the new update will wait
> for-ever on it and the update will not be effective.
>
> The solution is to implement cancel request. Since the work for a stack
> is distributed among heat engines, the cancel request will not work as
> it does in legacy way. Many or all of the heat engines might be running
> worker threads to provision a stack.
>
> I could think of two options which I would like to discuss:
>
> (a) When a user triggered cancel request is received, set the stack
> current traversal to None or something else other than current
> traversal. With this the new check-resources/workers will never be
> triggered. This is okay as long as the worker(s) is not stuck. The
> existing workers will finish running, and no new check-resource
> (workers) will be triggered, and it will be a graceful cancel. But the
> workers that are stuck will be stuck for-ever till stack times-out. To
> take care of such cases, we will have to implement logic of "polling"
> the DB at regular intervals (may be at each step() of scheduler task)
> and bail out if the current traversal is updated. Basically, each worker
> will "poll" the DB to see if the current traversal is still valid and if
> not, stop itself. The drawback of this approach is that all the workers
> will be hitting the DB and incur a significant overhead. Besides, all
> the stack workers irrespective of whether they will be cancelled or not,
> will keep on hitting DB. The advantage is that it probably is easier to
> implement. Also, if the worker is stuck in particular "step", then this
> approach will not work.
>
> (b) Another approach is to send cancel message to all the heat engines
> when one receives a stack cancel request. The idea is to use the thread
> group manager in each engine to keep track of threads running for a
> stack, and stop the thread group when a cancel message is received. The
> advantage is that the messages to cancel stack workers is sent only when
> required and there is no other over-head. The draw-back is that the
> cancel message is 'broadcasted' to all heat engines, even if they are
> not running any workers for the given stack, though, in such cases, it
> will be a just no-op for the heat-engine (the message will be gracefully
> discarded).
Oh hah, I just sent (b) as an option to avoid (a) without really
thinking about (b) again.
I don't think the cancel broadcasts are all that much of a drawback. I
do think you need to rate limit cancels though, or you give users the
chance to DDoS the system.
More information about the OpenStack-dev
mailing list