[openstack-dev] [heat] health maintenance in autoscaling groups
Clint Byrum
clint at fewbar.com
Wed Jul 2 17:54:49 UTC 2014
Excerpts from Qiming Teng's message of 2014-07-02 00:02:14 -0700:
> Just some random thoughts below ...
>
> On Tue, Jul 01, 2014 at 03:47:03PM -0400, Mike Spreitzer wrote:
> > In AWS, an autoscaling group includes health maintenance functionality ---
> > both an ability to detect basic forms of failures and an ability to react
> > properly to failures detected by itself or by a load balancer. What is
> > the thinking about how to get this functionality in OpenStack? Since
>
> We are prototyping a solution to this problem at IBM Research - China
> lab. The idea is to leverage oslo.messaging and ceilometer events for
> instance (possibly other resource such as port, securitygroup ...)
> failure detection and handling.
>
Hm.. perhaps you should be contributing some reviews here as you may
have some real insight:
https://review.openstack.org/#/c/100012/
This sounds a lot like what we're working on for continuous convergence.
> > OpenStack's OS::Heat::AutoScalingGroup has a more general member type,
> > what is the thinking about what failure detection means (and how it would
> > be accomplished, communicated)?
>
> When most OpenStack services are making use of oslo.notify, in theory, a
> service should be able to send/receive events related to resource
> status. In our current prototype, at least host failure (detected in
> Nova and reported with a patch), VM failure (detected by nova), and some
> lifecycle events of other resources can be detected and then collected
> by Ceilometer. There is certainly a possibility to listen to the
> message queue directly from Heat, but we only implemented the Ceilometer
> centric approach.
>
> >
> > I have not found design discussion of this; have I missed something?
> >
> > I suppose the natural answer for OpenStack would be centered around
> > webhooks. An OpenStack scaling group (OS SG = OS::Heat::AutoScalingGroup
> > or AWS::AutoScaling::AutoScalingGroup or OS::Heat::ResourceGroup or
> > OS::Heat::InstanceGroup) could generate a webhook per member, with the
> > meaning of the webhook being that the member has been detected as dead and
> > should be deleted and removed from the group --- and a replacement member
> > created if needed to respect the group's minimum size.
>
> Well, I would suggest we generalize this into a event messaging or
> signaling solution, instead of just 'webhooks'. The reason is that
> webhooks as it is implemented today is not carrying a payload of useful
> information -- I'm referring to the alarms in Ceilometer.
>
> There are other cases as well. A member failure could be caused by a
> temporary communication problem, which means it may show up quickly when
> a replacement member is already being created. It may mean that we have
> to respond to an 'online' event in addition to an 'offline' event?
>
The ideas behind convergence help a lot here. Skew happens in distributed
systems, so we expect it constantly. In the extra-capacity situation
above, we would just deal with it by scaling back down. There are also
situations where we might accidentally create two physical resources
because we got a 500 from the API but it was after the resource was
being created. This is the same problem, and has the same answer: pick
one and scale down (and if this is a critical server like a database,
we'll need lifecycle callbacks that will prevent suddenly killing a node
that would cost you uptime or recovery time).
> > When the member is
> > a Compute instance and Ceilometer exists, the OS SG could define a
> > Ceilometer alarm for each member (by including these alarms in the
> > template generated for the nested stack that is the SG), programmed to hit
> > the member's deletion webhook when death is detected (I imagine there are
> > a few ways to write a Ceilometer condition that detects instance death).
>
> Yes. Compute instance failure can be detected with a Ceilometer plugin.
> In our prototype, we developed a Dispatcher plugin that can handle
> events like 'compute.instance.delete.end', 'compute.instance.create.end'
> after they have been processed based on a event_definitions.yaml file.
> There could be other ways, I think.
>
> The problem here today is about the recovery of SG member. If it is a
> compute instance, we can 'reboot', 'rebuild', 'evacuate', 'migrate' it,
> just to name a few options. The most brutal way to do this is like what
> HARestarter is doing today -- delete followed by a create.
>
Right, so lifecycle callbacks are useful here, as we can expose an
interface for delaying and even cancelling a lifecycle event.
More information about the OpenStack-dev
mailing list