[openstack-dev] [Heat][Ceilometer] A proposal to enhance ceilometer alarm

Eoghan Glynn eglynn at redhat.com
Mon Jul 7 07:46:19 UTC 2014



> > Alarms in ceilometer may currently only be based on a statistics trend
> > crossing a threshold, and not on the occurrence of an event such as
> > compute.instance.delete.end.
> 
> Right.  I realized this after spending some more time understanding the
> alarm-evaluator code.  Having 'Statistics' model to record (even the
> last sample of) a field will be cumbersome.

Yep.
 
> > Near the end of the Icehouse cycle, there was an attempt to implement
> > this style of notification-based alarming but the feature did not land.
> 
> After realizing 'Statistics' is not the ideal place for extension, I
> took a step back and asked myself: "what am I really trying to get from
> Ceilometer?" The answer seems to be an Alarm or Event, with some
> informational fields telling me some context of such an Alarm or Event.
> So I am now thinking of a EventAlarm in addition to ThresholdAlarm and
> CombinationAlarm.  The existing alarms are all based on meter samples.
> Such an event based alarm would be very helpful to implement features
> like keeping members of a AutoScalingGroup (or other Resource Group)
> alive.

So as I mentioned, we did have an attempt to provide notification-based
alarming at the end of Icehouse:

  https://review.openstack.org/69473

but that did not land.

It might be feasible to resurrect this, based on the fact that the events
API will shortly be available right across the range of ceilometer v2
storage drivers (i.e. not just for sqlalchemy).

However this is not currently a priority item on our roadmap (though
as always, patches are welcome).

Note though that the Heat-side logic to consume the event-alarm triggered
by a compute.instance.delete event wouldn't be trivial, as Heat would have
to start remembering which instances it had *itself* deleted as part of
the normal growth and shrinkage pattern of an autoscaling group

(so that it can distinguish a intended instance deletion from an accidental
deletion)

I'm open to correction, but AFAIK Heat does not currently record such
state.
 
> > Another option would be for Heat itself to consume notifications and/or
> > periodically check the integrity of the autoscaling group via nova-api,
> > to ensure no members have been inadvertently deleted.
> 
> Yes. That has been considered by the Heat team as well.  The only
> concern regarding directly subscribing to notification and then do
> filtering sounds a duplicated work already done in Ceilometer. From the
> use case of convergence, you can guess that this is acutally not limited
> to the auto-scaling scenario.

Sure, but does convergence sound like it's *relevant* to the autoscaling
case?
 
> > This actually smells a little some of the requirements driving the
> > notion of "convergence" in Heat:
> > 
> >   https://review.openstack.org/#/c/95907/6/specs/convergence.rst
> > 
> > TL;DR: make reality the source the truth in Heat, as opposed to the
> >        approximation of reality expressed in the template
> > 
> > >  - When a VM connected to multiple subnets is experiencing bandwidth
> > >    problem, an alarm can be generated telling Heat which subnet is to be
> > >    checked.
> > 
> > Would such a bandwidth issue be suitable for auto-remediation by the
> > *auto*scaling logic?
> > 
> > Or would it require manual intervention?
> 
> As I have noted above, getting notified by physical resource state
> changes and then reacting properly is THE requirement.  It is beyond
> what auto-scaling does today.  There are cases where manual intervention
> is needed, while there are other cases where Heat can handle given
> sufficient information.

Can you provide some examples of those latter cases?

(so as to ground this discussion solidly in the here-and-now)
 
> > > We believe there will be many other use cases expecting an alarm to
> > > carry some 'useful' information beyond just a state transition. Below is
> > > a proposal to solve this.  Any comments are welcomed.
> > > 
> > > 1. extend the alarm with an optional parameter, say, 'output', which is
> > >    a map or an equivalent representation.  A user can specify some
> > >    key=value pairs using this parameter, where 'key' is a convenience
> > >    for user and value is used to specify a field from a Sample whose
> > >    value will be filled  in here.
> > > 
> > >    e.g. --output instance=metadata.instance_id;timestamp=timestamp
> > 
> > While such additional context may be useful, I'm not sure your examples
> > would apply in general because:
> > 
> >  * there wouldn't be a *single* distinguished instance ID that caused
> >    the alarm statistic to go over-threshold (as the cpu_util or whatever
> >    metric is aggregated across the entire autoscaling group in the alarm
> >    evaluation)
> > 
> >  * there wouldn't be a discrete timestamp when the statistic crossed the
> >    alarm threshold due to perioidization and sampling effects
> 
> Okay.  I admit that if the alarm is evaluated based on Statistics, these
> are all true concerns.  I didn't quite realize that before.  What do you
> think if Ceilometer provides an EventAlarm then?  If Alarm is generated
> from an Event, then the above context can be extracted, at least by
> tweaking event_definitions.yaml?

Possibly, yes.

I'd imagine that such a feature would include the ability to request
that certain event fields ("traits") are included in the alarm reason.

Cheers,
Eoghan
 
> > > 2. extend the Ceilometer alarm-evaluator service, so that when an alarm
> > >    is seen requiring output values, it will try matching the 'value'
> > >    specified above to the fields in a sample, and replace the output
> > >    entry with 'key=<real_value>'.
> > > 
> > >    e.g. "output": {
> > >           "instance": "bd56bb53-d07f-49a6-8f60-6f8ef1336060",
> > > 	  "timestamp": "2014-07-0102: 21: 13.002155",
> > > 	}
> > > 
> > >    The above data is passed back to the alarm_url as part of its
> > >    existing payload.
> > > 
> > >    If alarm-evaluator cannot find a matching field, it can fill in an
> > >    empty string, or just "None".
> > > 
> > > 3. extend the OS::Ceilometer::Alarm resource type in Heat so that an
> > >    optional property (say, 'output') of type map can be used to specify
> > >    what are expected from the Alarm.
> > 
> > And would the logic to consume such additional context be baked into
> > the heat.engine.resources.autoscaling module?
> > 
> > Or would that be plugable somehow?
> 
> We have an interest to improve the AutoScalingGroup resource so that
> member failures can be detected and handled properly.  This may warrant
> a specs in Heat project.
> 
> > > Since it is an additional field in the 'details' argument, the impact to
> > > existing Heat template/users will be negligible.  However, the
> > > expressive power of carrying back additional fields would be a great
> > > help to some scenarios we yet to know.
> > > 
> > > Because this is a cross-project proposal, comments from both communities
> > > are valuable and thus appreciated.  If it is a viable approach, should
> > > we raise two specs in both projects repectively?
> > 
> > I'm unconvinced as yet as to the viability, on the basis of my comments
> > above. Though I'll keep an open mind with regard to your responses.
> > 
> > Thanks,
> > Eoghan
> > 
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > 
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 



More information about the OpenStack-dev mailing list