[openstack-dev] [heat] Convergence: Detecting and handling worker failures

Dulko, Michal michal.dulko at intel.com
Wed Sep 30 14:25:53 UTC 2015


On Wed, 2015-09-30 at 02:29 -0700, Clint Byrum wrote:
> Excerpts from Anant Patil's message of 2015-09-30 00:10:52 -0700:
> > Hi,
> > 
> > One of remaining items in convergence is detecting and handling engine
> > (the engine worker) failures, and here are my thoughts.
> > 
> > Background: Since the work is distributed among heat engines, by some
> > means heat needs to detect the failure and pick up the tasks from failed
> > engine and re-distribute or run the task again.
> > 
> > One of the simple way is to poll the DB to detect the liveliness by
> > checking the table populated by heat-manage. Each engine records its
> > presence periodically by updating current timestamp. All the engines
> > will have a periodic task for checking the DB for liveliness of other
> > engines. Each engine will check for timestamp updated by other engines
> > and if it finds one which is older than the periodicity of timestamp
> > updates, then it detects a failure. When this happens, the remaining
> > engines, as and when they detect the failures, will try to acquire the
> > lock for in-progress resources that were handled by the engine which
> > died. They will then run the tasks to completion.
> > 
> > Another option is to use a coordination library like the community owned
> > tooz (http://docs.openstack.org/developer/tooz/) which supports
> > distributed locking and leader election. We use it to elect a leader
> > among heat engines and that will be responsible for running periodic
> > tasks for checking state of each engine and distributing the tasks to
> > other engines when one fails. The advantage, IMHO, will be simplified
> > heat code. Also, we can move the timeout task to the leader which will
> > run time out for all the stacks and sends signal for aborting operation
> > when timeout happens. The downside: an external resource like
> > Zookeper/memcached etc are needed for leader election.
> > 
> 
> It's becoming increasingly clear that OpenStack services in general need
> to look at distributed locking primitives. There's a whole spec for that
> right now:
> 
> https://review.openstack.org/#/c/209661/
> 
> I suggest joining that conversation, and embracing a DLM as the way to
> do this.
> 
> Also, the leader election should be per-stack, and the leader selection
> should be heavily weighted based on a consistent hash algorithm so that
> you get even distribution of stacks to workers. You can look at how
> Ironic breaks up all of the nodes that way. They're using a similar lock
> to the one Heat uses now, so the two projects can collaborate nicely on
> a real solution.

It is worth to mention that there's also an idea of using both Tooz and
hash ring approach [1].

There was enormously big discussion on this list when Cinder's faced
similar problem [2]. It finally became a discussion on whether we need a
common solution for DLM in OpenStack [3]. In the end Cinder is currently
trying to achieve A/A capabilities by using CAS DB operations. The
detecting of failed services is still discussed, but most mature
solution to this problem was described in [4]. It is based on database
checks.

Given that many projects are facing similar problems (well, it's not a
surprise that distributed system is facing general problems of
distributed systems…), we should certainly discuss how to approach that
class of issues. That's why a cross-project Design Summit session on the
topic was proposed [5] (this one is by harlowja, but I know that Mike
Perez also wanted to propose such session).

[1] https://review.openstack.org/#/c/195366/
[2]
http://lists.openstack.org/pipermail/openstack-dev/2015-July/070683.html
[3]
http://lists.openstack.org/pipermail/openstack-dev/2015-August/071262.html
[4] http://gorka.eguileor.com/simpler-road-to-cinder-active-active/
[5] http://odsreg.openstack.org/cfp/details/8


More information about the OpenStack-dev mailing list