[openstack-dev] [Heat][Ceilometer] A proposal to enhance ceilometer alarm

Eoghan Glynn eglynn at redhat.com
Mon Jul 7 06:13:57 UTC 2014



> In current Alarm implementation, Ceilometer will send back Heat an
> 'alarm' using the pre-signed URL (or other channel under development).

By the other channel, do you mean the trusts-based interaction?

We discussed this at the mid-cycle in Paris last week, and it turns out
there appear to be a few restrictions on trusts that limit the usability
of this keystone feature, specifically:

 * no support for cross-domain delegation of privilege (important as
   the frontend stack user and the ceilometer service user are often
   in different domains) 

 * no support for creating a trust based on username+domain as opposed
   to user UUID (the former may be predictable at the time of config
   file generation, whereas the latter is less likely to be so)

 * no support for cascading delegation (i.e. no creation of trusts from
   trusts)

If these shortcomings are confirmed by the domain experts on the keystone
team, we're not likely to invest further time in trusts until some of these
issues are addressed on the keystone side.

> The alarm carries a payload that looks like:
> 
>  {
>    alarm_id: ID
>    previous: ok
>    current: alarm
>    reason: transision to alarm due to n samples outside thredshold,
>            most recent: ....
>    reason_data: {
>      type: threshold
>      disposition: inside
>      count: x
>      most_recent: value
>    }
>  }
> 
> While this data structure is useful for some simple use cases, it can be
> enhanced to carry more useful data.  Some usage scenarios are:
> 
>  - When a member of AutoScalingGroup is dead (e.g. accidently deleted),
>    Ceilometer can detect this from a event with count='instance',
>    event_type='compute.instance.delete.end'.  If an alarm created out of
>    this event, the AutoScalingGroup may have a chance to recover the
>    member when appropriate.  The requirement is for this Alarm to tell
>    Heat which instance is dead.

Alarms in ceilometer may currently only be based on a statistics trend
crossing a threshold, and not on the occurrence of an event such as
compute.instance.delete.end.

Near the end of the Icehouse cycle, there was an attempt to implement
this style of notification-based alarming but the feature did not land.

Another option would be for Heat itself to consume notifications and/or
periodically check the integrity of the autoscaling group via nova-api,
to ensure no members have been inadvertently deleted.

This actually smells a little some of the requirements driving the
notion of "convergence" in Heat:

  https://review.openstack.org/#/c/95907/6/specs/convergence.rst

TL;DR: make reality the source the truth in Heat, as opposed to the
       approximation of reality expressed in the template

>  - When a VM connected to multiple subnets is experiencing bandwidth
>    problem, an alarm can be generated telling Heat which subnet is to be
>    checked.

Would such a bandwidth issue be suitable for auto-remediation by the
*auto*scaling logic?

Or would it require manual intervention?
 
> We believe there will be many other use cases expecting an alarm to
> carry some 'useful' information beyond just a state transition. Below is
> a proposal to solve this.  Any comments are welcomed.
> 
> 1. extend the alarm with an optional parameter, say, 'output', which is
>    a map or an equivalent representation.  A user can specify some
>    key=value pairs using this parameter, where 'key' is a convenience
>    for user and value is used to specify a field from a Sample whose
>    value will be filled  in here.
> 
>    e.g. --output instance=metadata.instance_id;timestamp=timestamp

While such additional context may be useful, I'm not sure your examples
would apply in general because:

 * there wouldn't be a *single* distinguished instance ID that caused
   the alarm statistic to go over-threshold (as the cpu_util or whatever
   metric is aggregated across the entire autoscaling group in the alarm
   evaluation)

 * there wouldn't be a discrete timestamp when the statistic crossed the
   alarm threshold due to perioidization and sampling effects

> 2. extend the Ceilometer alarm-evaluator service, so that when an alarm
>    is seen requiring output values, it will try matching the 'value'
>    specified above to the fields in a sample, and replace the output
>    entry with 'key=<real_value>'.
> 
>    e.g. "output": {
>           "instance": "bd56bb53-d07f-49a6-8f60-6f8ef1336060",
> 	  "timestamp": "2014-07-0102: 21: 13.002155",
> 	}
> 
>    The above data is passed back to the alarm_url as part of its
>    existing payload.
> 
>    If alarm-evaluator cannot find a matching field, it can fill in an
>    empty string, or just "None".
> 
> 3. extend the OS::Ceilometer::Alarm resource type in Heat so that an
>    optional property (say, 'output') of type map can be used to specify
>    what are expected from the Alarm.

And would the logic to consume such additional context be baked into
the heat.engine.resources.autoscaling module?

Or would that be plugable somehow?

> Since it is an additional field in the 'details' argument, the impact to
> existing Heat template/users will be negligible.  However, the
> expressive power of carrying back additional fields would be a great
> help to some scenarios we yet to know.
> 
> Because this is a cross-project proposal, comments from both communities
> are valuable and thus appreciated.  If it is a viable approach, should
> we raise two specs in both projects repectively?

I'm unconvinced as yet as to the viability, on the basis of my comments
above. Though I'll keep an open mind with regard to your responses.

Thanks,
Eoghan



More information about the OpenStack-dev mailing list