[openstack-dev] [heat] Convergence: Detecting and handling worker failures

Joshua Harlow harlowja at outlook.com
Wed Sep 30 15:16:34 UTC 2015


Clint Byrum wrote:
> Excerpts from Anant Patil's message of 2015-09-30 00:10:52 -0700:
>> Hi,
>>
>> One of remaining items in convergence is detecting and handling engine
>> (the engine worker) failures, and here are my thoughts.
>>
>> Background: Since the work is distributed among heat engines, by some
>> means heat needs to detect the failure and pick up the tasks from failed
>> engine and re-distribute or run the task again.
>>
>> One of the simple way is to poll the DB to detect the liveliness by
>> checking the table populated by heat-manage. Each engine records its
>> presence periodically by updating current timestamp. All the engines
>> will have a periodic task for checking the DB for liveliness of other
>> engines. Each engine will check for timestamp updated by other engines
>> and if it finds one which is older than the periodicity of timestamp
>> updates, then it detects a failure. When this happens, the remaining
>> engines, as and when they detect the failures, will try to acquire the
>> lock for in-progress resources that were handled by the engine which
>> died. They will then run the tasks to completion.
>>
>> Another option is to use a coordination library like the community owned
>> tooz (http://docs.openstack.org/developer/tooz/) which supports
>> distributed locking and leader election. We use it to elect a leader
>> among heat engines and that will be responsible for running periodic
>> tasks for checking state of each engine and distributing the tasks to
>> other engines when one fails. The advantage, IMHO, will be simplified
>> heat code. Also, we can move the timeout task to the leader which will
>> run time out for all the stacks and sends signal for aborting operation
>> when timeout happens. The downside: an external resource like
>> Zookeper/memcached etc are needed for leader election.
>>
>
> It's becoming increasingly clear that OpenStack services in general need
> to look at distributed locking primitives. There's a whole spec for that
> right now:
>
> https://review.openstack.org/#/c/209661/

As the author of said spec (Chronicles of a DLM) I fully agree that we 
shouldn't be reinventing this (again, and again). Also as the author of 
that spec, I'd like to encourage others to get involved in adding their 
use-cases/stories to it. I have done some initial analysis of projects 
and documented some of the recreation of DLM like things in it, and I'm 
very much open to including others stories as well. In the end I hope we 
can pick a DLM (ideally a single one) that has a wide community, is 
structurally sound, is easily useable & operable, is open and will help 
achieve and grow (what I think are) the larger long-term goals (and 
health) of many openstack projects.

Nicely formatted RST (for the latest uploaded spec) also viewable at:

http://docs-draft.openstack.org/61/209661/22/check/gate-openstack-specs-docs/ced42e7//doc/build/html/specs/chronicles-of-a-dlm.html#chronicles-of-a-distributed-lock-manager

>
> I suggest joining that conversation, and embracing a DLM as the way to
> do this.
>
> Also, the leader election should be per-stack, and the leader selection
> should be heavily weighted based on a consistent hash algorithm so that
> you get even distribution of stacks to workers. You can look at how
> Ironic breaks up all of the nodes that way. They're using a similar lock
> to the one Heat uses now, so the two projects can collaborate nicely on
> a real solution.
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list