[openstack-dev] [heat] health maintenance in autoscaling groups

Qiming Teng tengqim at linux.vnet.ibm.com
Thu Jul 3 03:52:25 UTC 2014


On Wed, Jul 02, 2014 at 11:02:36AM +0100, Steven Hardy wrote:
> On Wed, Jul 02, 2014 at 03:02:14PM +0800, Qiming Teng wrote:
> > Just some random thoughts below ...
> > 
> > On Tue, Jul 01, 2014 at 03:47:03PM -0400, Mike Spreitzer wrote:
> > > In AWS, an autoscaling group includes health maintenance functionality --- 
> > > both an ability to detect basic forms of failures and an ability to react 
> > > properly to failures detected by itself or by a load balancer.  What is 
> > > the thinking about how to get this functionality in OpenStack?  Since 
> > 
> > We are prototyping a solution to this problem at IBM Research - China
> > lab.  The idea is to leverage oslo.messaging and ceilometer events for
> > instance (possibly other resource such as port, securitygroup ...)
> > failure detection and handling.
> 
> This sounds interesting, are you planning to propose a spec for heat
> describing this work and submit your patches to heat?

Steve, this work is still a prototype yet having the loop to be closed.
The basic idea is:

1. Ensure nova server redundancy by providing a VMCluster resource type
in the form of Heat plugin.  It could be contributed back to the
community if proved useful.  I have two concerns: 1) it is not a generic
solution yet, due to lacking support to template resources; 2) instead
of a new resource type, maybe a better approach is to add an optional
group of properties to Nova server specifying its HA requirement.

2. Detection of Host/VM failures.  Currently rely on Nova's detection of
VM lifecycle events.  I'm not sure it is applicable to hypervisors other
than KVM.  We have some patches to the ServiceGroup service in Nova so
that Host failures can be detected and reported too.  This can be a
patch to Nova.

3. Recovery from Host/VM failures.  We can either use the Events
collected by Ceilometer directly, or have Ceilometer convert Event into
Samples so that we can reuse the Alarm service (evaluator + notifier).
Neither way is working now.  For the Event path, we are blocked by the
authentication problem; for the Alarm path, we don't know how to carry a
payload via the AlarmUrl.

Some help and guidance would be highly appreciated.

> > 
> > > OpenStack's OS::Heat::AutoScalingGroup has a more general member type, 
> > > what is the thinking about what failure detection means (and how it would 
> > > be accomplished, communicated)?
> > 
> > When most OpenStack services are making use of oslo.notify, in theory, a
> > service should be able to send/receive events related to resource
> > status.  In our current prototype, at least host failure (detected in
> > Nova and reported with a patch), VM failure (detected by nova), and some
> > lifecycle events of other resources can be detected and then collected
> > by Ceilometer.  There is certainly a possibility to listen to the
> > message queue directly from Heat, but we only implemented the Ceilometer
> > centric approach.
> 
> It has been pointed out a few times that in large deployments, different
> services may not share the same message bus.  So while *an* option could be
> heat listenting to the message bus, I'd prefer that we maintain the alarm
> notifications via the ReST API as the primary signalling mechanism.

Agreed. IIRC, somewhere in the Ceilometer documentation, it was
suggested to use different use different queues for different purposes.
No objection to keep alarms as the primary mechanism till we have a
compelling reason to change.

> > > 
> > > I have not found design discussion of this; have I missed something?
> > > 
> > > I suppose the natural answer for OpenStack would be centered around 
> > > webhooks.  An OpenStack scaling group (OS SG = OS::Heat::AutoScalingGroup 
> > > or AWS::AutoScaling::AutoScalingGroup or OS::Heat::ResourceGroup or 
> > > OS::Heat::InstanceGroup) could generate a webhook per member, with the 
> > > meaning of the webhook being that the member has been detected as dead and 
> > > should be deleted and removed from the group --- and a replacement member 
> > > created if needed to respect the group's minimum size.  
> > 
> > Well, I would suggest we generalize this into a event messaging or
> > signaling solution, instead of just 'webhooks'.  The reason is that
> > webhooks as it is implemented today is not carrying a payload of useful
> > information -- I'm referring to the alarms in Ceilometer.
> 
> The resource signal interface used by ceilometer can carry whatever data
> you like, so the existing solution works fine, we don't need a new one IMO.
> 
> For example look at this patch which converts WaitConditions to use the
> resource_signal interface:
> 
> https://review.openstack.org/#/c/101351/2/heat/engine/resources/wait_condition.py
> 
> We pass the data to the WaitCondition via a resource signal, the exact same
> transport that is used for alarm notifications from ceilometer.

I can understand the Heat side processing of signal payload.  From the
triggering side, I haven't seen an example showing me how to encode
additional payload into the 'alarmUrl' string.  Resource signal ReST
call uses a different URI, right?

> 
> Note the "webhook" thing really just means a pre-signed request, which
> using the v2 AWS style signed requests (currently the only option for heat
> pre-signed requests) does not sign the request body.
> 
> This is a security disadvantage (addressed by the v3 AWS signing scheme),
> but it does mean you can pass data via the pre-signed URL.

The Ceilometer side implementation has had 'reason', 'reason_data'
hardcoded inside the alarm subsystem.  I don't know if it is appropriate
to change that.

> An alternative to pre-signed URLs is simply to make an authenticated call
> to the native ReST API, but then whatever is signalling requires either
> credentials, a token, or a trust to impersonate the stack owner. Again, you
> can pass whatever data you want via this interface.

I tried this alternative as well, but was stumbled by the authentication
problem.  If I am not using an Alarm, I don't know yet how to make an
authenticated call.
 
> > There are other cases as well.  A member failure could be caused by a 
> > temporary communication problem, which means it may show up quickly when
> > a replacement member is already being created.  It may mean that we have
> > to respond to an 'online' event in addition to an 'offline' event?
> > 
> > > When the member is 
> > > a Compute instance and Ceilometer exists, the OS SG could define a 
> > > Ceilometer alarm for each member (by including these alarms in the 
> > > template generated for the nested stack that is the SG), programmed to hit 
> > > the member's deletion webhook when death is detected (I imagine there are 
> > > a few ways to write a Ceilometer condition that detects instance death). 
> > 
> > Yes.  Compute instance failure can be detected with a Ceilometer plugin.
> > In our prototype, we developed a Dispatcher plugin that can handle
> > events like 'compute.instance.delete.end', 'compute.instance.create.end'
> > after they have been processed based on a event_definitions.yaml file.
> > There could be other ways, I think.
> 
> Are you aware of the "Existence of instance" meter in ceilometer?
> 
> http://docs.openstack.org/developer/ceilometer/measurements.html
> 
> I noticed that recently and wondered if it provides an initial metric we
> could use to set an alarm so we're notified if an instance in an
> autoscaling group is deleted out of band and no longer exists?
> 

Yes.  It looks like a regular checking over instances, and an event of
'compute.instance.exists' will be generated.  We choose the events
'compute.instance.create.end' and 'compute.instance.delete.end' because
we don't want to be bothered by so many 'I'm Alive' messages.  The rule
was: "be quiet when everything is okay, and speak aloud when something is
wrong, we can hear you."

> > The problem here today is about the recovery of SG member.  If it is a
> > compute instance, we can 'reboot', 'rebuild', 'evacuate', 'migrate' it,
> > just to name a few options.  The most brutal way to do this is like what
> > HARestarter is doing today -- delete followed by a create.
> 
> Well it's also the same as you would do in a scaling group - if a metric
> showed absence or lack of health for an instance, you could just delete it
> and build a replacement.
> 
> This is why I think HARestarter should be deprecated in favour of just
> using AutoScalingGroups combined with appropriate alarms.

Last week on IRC I consulted Zane about adding a 'restart()' method to
Resource which defaults to 'delete + create'.  Then we can remove the
'restart_resource()' method from Stack, and leave subclasses of Resource
a possibility to override 'restart()'.  This would mean a subclass of
Resource could have a chance to perform a proper restart operation.
Zane believes this is doable, but it will be eventually handled by convergence.

> > > When the member is a nested stack and Ceilometer exists, it could be the 
> > > member stack's responsibility to include a Ceilometer alarm that detects 
> > > the member stack's death and hit the member stack's deletion webhook. 
> > 
> > This is difficult.  A '(nested) stack' is a Heat specific abstraction --
> > recall that we have to annotate a nova server resource in its metadata
> > to which stack this server belongs.  Besides the 'visible' resources
> > specified in a template, Heat may create internal data structures and/or
> > resources (e.g. users) for a stack.  I am not quite sure a stack's death
> > can be easily detected from outside Heat.  It would be at least
> > cumbersome to have Heat notify Ceilometer that a stack is dead, and then
> > have Ceilometer send back a signal.
> > 
> > > There is a small matter of how the author of the template used to create 
> > > the member stack writes some template snippet that creates a Ceilometer 
> > > alarm that is specific to a member stack that does not exist yet.  
> > 
> > How about just one signal responder per ScalingGroup?  A SG is supposed
> > to be in a better position to make the judgement: do I have to recreate
> > a failed member? am I recreating it right now or wait a few seconds?
> > maybe I should recreate the member on some specific AZs?
> 
> This is what we have already - you have one ScalingPolicy (which is a
> SignalResponder), and the ScalingPolicy is the place where you make the
> decision about what to do with the data provided from the alarm.
> 
> What we're currently missing is a way to pass data in when doing the scale
> up/down of the group so the ScalingPolicy could trigger replacement of a
> failed instance instead of just building a new one (we'd pass the id of the
> failed instance in as a hint, then we'd build a new one and remove the
> failed one).

Can we generalize this into a 'HAPolicy' instead of a 'ScalingPolicy'?
Any ResourceGroup Mike talked about may have a desire to maintain its
current number of members, unless when scaling out/in is needed.

> > If there is only one signal responder per SG, then the 'webhook' (or
> > resource signal, my preference) need to carry a payload indicating when
> > and which member is down. 
> 
> "webhooks" and resource signals are the same thing, it's just the auth
> method which differs (and which API you hit), inside the engine/resource
> implementation they are exactly the same.

Understood.  My question was based on my limited understanding of Ceilometer
alarm where I cannot find a way to inject anything into the data sent
via the alarm.  That doesn't mean it cannot be changed.  So I will
change my definition of 'webhooks', which only meant to me a fixed URL with
fixed data payload or nothing.

> resource signals can already carry a payload, so it's just a case of
> getting ceilometer to provide the appropriate data when sending the alarm
> signal, and adjusting the ScalingPolicy to use it appropriately.

Agreed.

> > > I suppose we could stipulate that if the member template includes a 
> > > parameter with name "member_name" and type "string" then the OS OG takes 
> > > care of supplying the correct value of that parameter; as illustrated in 
> > > the asg_of_stacks.yaml of https://review.openstack.org/#/c/97366/ , a 
> > > member template can use a template parameter to tag Ceilometer data for 
> > > querying.  The URL of the member stack's deletion webhook could be passed 
> > > to the member template via the same sort of convention.  
> > 
> > I am not in favor of the per-member webhook design.  But I vote for an
> > additional *implicit* parameter to a nested stack of any groups.  It
> > could be an index or a name.
> 
> I agree, we just need appropriate metadata in ceilometer, which can then be
> passed back to heat via the resource signal when the alarm happens.
> 
> > > When Ceilometer 
> > > does not exist, it is less obvious to me what could usefully be done.  Are 
> > > there any useful SG member types besides Compute instances and nested 
> > > stacks?  Note that a nested stack could also pass its member deletion 
> > > webhook to a load balancer (that is willing to accept such a thing, of 
> > > course), so we get a lot of unity of mechanism between the case of 
> > > detection by infrastructure vs. application level detection.
> > > 
> > 
> > I'm a little bit concerned about passing the member deletion webhook to
> > LB.  Maybe we need to rethink about this: do we really want to bring
> > application level design considerations down to the infrastructure level?
> > 
> > Some of the detection work might be covered by the observer engine specs
> > that is under review.  My doubt about it is about how to make it "listen
> > only to what need to know while ignore everything else".
> > 
> > > I am not entirely happy with the idea of a webhook per member.  If I 
> > > understand correctly, generating webhooks is a somewhat expensive and 
> > > problematic process.  What would be the alternative?
> > 
> > My understanding is that the webhooks' problem is not about cost, it is
> > more about authentication and flexibility.  Steve Hardy and Thomas Herve
> > are already looking into the authentication problem.
> 
> Well every SignalResponder resource creates a user in keystone, so not
> "expensive" as such, but it makes sense IMO to stick to the current model,
> where filtering of things happens in ceilometer, then we get an alarm
> containing data sent to the scaling policy resource.  Having every group
> member be a signal responder definitely does not make sense to me.
> 
> The first step is identifying what data ceilometer needs to send us, and
> the second step is getting the (native) scaling policy resource to use it.
> The current transport and signalling topology should be sufficient AFAICS.

In our current prototype (not working yet), we want to send the
following JSON back to a Resource Group in Heat as the payload for a 
signal/webhook:

 {
   "event": "ha_event",
   "sender": "ceilometer",
   "reason": "VM failure",   # could be "Host failure" 
   "timestamp": "2014-07-03 11:22:33.5678",
   "offline": [ "1234-4567-6789-7890abcdef-cdef" ]
   "online": []
 }

> Steve
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 




More information about the OpenStack-dev mailing list