Open Stack

Wed Sep 30 07:10:52 UTC 2015

Hi,

One of remaining items in convergence is detecting and handling engine
(the engine worker) failures, and here are my thoughts.

Background: Since the work is distributed among heat engines, by some
means heat needs to detect the failure and pick up the tasks from failed
engine and re-distribute or run the task again.

One of the simple way is to poll the DB to detect the liveliness by
checking the table populated by heat-manage. Each engine records its
presence periodically by updating current timestamp. All the engines
will have a periodic task for checking the DB for liveliness of other
engines. Each engine will check for timestamp updated by other engines
and if it finds one which is older than the periodicity of timestamp
updates, then it detects a failure. When this happens, the remaining
engines, as and when they detect the failures, will try to acquire the
lock for in-progress resources that were handled by the engine which
died. They will then run the tasks to completion.

Another option is to use a coordination library like the community owned
tooz (http://docs.openstack.org/developer/tooz/) which supports
distributed locking and leader election. We use it to elect a leader
among heat engines and that will be responsible for running periodic
tasks for checking state of each engine and distributing the tasks to
other engines when one fails. The advantage, IMHO, will be simplified
heat code. Also, we can move the timeout task to the leader which will
run time out for all the stacks and sends signal for aborting operation
when timeout happens. The downside: an external resource like
Zookeper/memcached etc are needed for leader election.

In the long run, IMO, using a library like tooz will be useful for heat.
A lot of boiler plate needed for locking and running centralized tasks
(such as timeout) will not be needed in heat. Given that we are moving
towards distribution of tasks and horizontal scaling is preferred, it
will be advantageous to use them.

Please share your thoughts.

- Anant

Open Stack

[openstack-dev] [heat] Convergence: Detecting and handling worker failures

OpenStack

Community

Documentation

Branding & Legal