[openstack-dev] [heat] Convergence: Detecting and handling worker failures

Anant Patil anant.patil at hpe.com
Wed Sep 30 14:03:46 UTC 2015


On 30-Sep-15 18:13, Ryan Brown wrote:
> On 09/30/2015 03:10 AM, Anant Patil wrote:
>> Hi,
>>
>> One of remaining items in convergence is detecting and handling engine
>> (the engine worker) failures, and here are my thoughts.
>>
>> Background: Since the work is distributed among heat engines, by some
>> means heat needs to detect the failure and pick up the tasks from failed
>> engine and re-distribute or run the task again.
>>
>> One of the simple way is to poll the DB to detect the liveliness by
>> checking the table populated by heat-manage. Each engine records its
>> presence periodically by updating current timestamp. All the engines
>> will have a periodic task for checking the DB for liveliness of other
>> engines. Each engine will check for timestamp updated by other engines
>> and if it finds one which is older than the periodicity of timestamp
>> updates, then it detects a failure. When this happens, the remaining
>> engines, as and when they detect the failures, will try to acquire the
>> lock for in-progress resources that were handled by the engine which
>> died. They will then run the tasks to completion.
> 
> Implementing our own locking system, even a "simple" one, sounds like a 
> recipe for major bugs to me. I agree with your assessment that tooz is a 
> better long-run decision.
> 
>> Another option is to use a coordination library like the community owned
>> tooz (http://docs.openstack.org/developer/tooz/) which supports
>> distributed locking and leader election. We use it to elect a leader
>> among heat engines and that will be responsible for running periodic
>> tasks for checking state of each engine and distributing the tasks to
>> other engines when one fails. The advantage, IMHO, will be simplified
>> heat code. Also, we can move the timeout task to the leader which will
>> run time out for all the stacks and sends signal for aborting operation
>> when timeout happens. The downside: an external resource like
>> Zookeper/memcached etc are needed for leader election.
> 
> That's not necessarily true. For single-node installations (devstack, 
> TripleO underclouds, etc) tooz offers file and IPC backends that don't 
> need an extra service. Tooz's MySQL/PostgreSQL backends only provide 
> distributed locking functionality, so we may need to depend on the 
> memcached/redis/zookeeper backends for multi-node installs.
> 

Definitely, for single-node installations, one can rely on IPC as
backend. As a convention, a default provider for single node as IPC
would be helpful for running heat in devstack or development
environment. From a holistic perspective, I am referring to external
resource, as mostly the deployments are multi-node with active-active
HA.

> Even if tooz doesn't provide everything we need, I'm sure patches
> would be welcome.
>
I am sure when we dive in, we will find use cases for tooz as well.

>> In the long run, IMO, using a library like tooz will be useful for heat.
>> A lot of boiler plate needed for locking and running centralized tasks
>> (such as timeout) will not be needed in heat. Given that we are moving
>> towards distribution of tasks and horizontal scaling is preferred, it
>> will be advantageous to use them.
>>
>> Please share your thoughts.
>>
>> - Anant




More information about the OpenStack-dev mailing list