[openstack-dev] [Heat][Ceilometer] A proposal to enhance ceilometer alarm

Steven Hardy shardy at redhat.com
Mon Jul 7 13:16:27 UTC 2014


On Mon, Jul 07, 2014 at 03:46:19AM -0400, Eoghan Glynn wrote:
> 
> 
> > > Alarms in ceilometer may currently only be based on a statistics trend
> > > crossing a threshold, and not on the occurrence of an event such as
> > > compute.instance.delete.end.
> > 
> > Right.  I realized this after spending some more time understanding the
> > alarm-evaluator code.  Having 'Statistics' model to record (even the
> > last sample of) a field will be cumbersome.
> 
> Yep.
>  
> > > Near the end of the Icehouse cycle, there was an attempt to implement
> > > this style of notification-based alarming but the feature did not land.
> > 
> > After realizing 'Statistics' is not the ideal place for extension, I
> > took a step back and asked myself: "what am I really trying to get from
> > Ceilometer?" The answer seems to be an Alarm or Event, with some
> > informational fields telling me some context of such an Alarm or Event.
> > So I am now thinking of a EventAlarm in addition to ThresholdAlarm and
> > CombinationAlarm.  The existing alarms are all based on meter samples.
> > Such an event based alarm would be very helpful to implement features
> > like keeping members of a AutoScalingGroup (or other Resource Group)
> > alive.
> 
> So as I mentioned, we did have an attempt to provide notification-based
> alarming at the end of Icehouse:
> 
>   https://review.openstack.org/69473
> 
> but that did not land.
> 
> It might be feasible to resurrect this, based on the fact that the events
> API will shortly be available right across the range of ceilometer v2
> storage drivers (i.e. not just for sqlalchemy).
> 
> However this is not currently a priority item on our roadmap (though
> as always, patches are welcome).
> 
> Note though that the Heat-side logic to consume the event-alarm triggered
> by a compute.instance.delete event wouldn't be trivial, as Heat would have
> to start remembering which instances it had *itself* deleted as part of
> the normal growth and shrinkage pattern of an autoscaling group

I'm not sure I understand this.  Heat maintains a nested template (with
associated resource information persisted in the DB) for autoscaling
groups, so if the instance exists in that template, it should exist.

If we get an alarm, or observe via convergence polling, that the instance
no longer exists, we can detect that there is a mismatch between the stored
state (the template) and the real state (thing got deleted out of band).

If you're saying we don't want to fight ourselves when an autoscaling
adjustment is in-progress, then that's true - probably we just need to
ensure that this type of alarm is ignored for the duration of any
autoscaling adjustment.

Even if we were to queue the alarm signals (some folks want "stacked"
updates for autoscaling groups), when we process the signal (after the
deletion for scale-down has happened), we'd just ignore the alarm, as it's
for an instance ID we no longer have any knowledge of in the DB.

> (so that it can distinguish a intended instance deletion from an accidental
> deletion)
> 
> I'm open to correction, but AFAIK Heat does not currently record such
> state.

I may be misunderstanding, but as above, I *think* we have sufficient data
in the DB to do the right thing here, provided we mask the signals during
scaling group update/adjustment.

> > > Another option would be for Heat itself to consume notifications and/or
> > > periodically check the integrity of the autoscaling group via nova-api,
> > > to ensure no members have been inadvertently deleted.
> > 
> > Yes. That has been considered by the Heat team as well.  The only
> > concern regarding directly subscribing to notification and then do
> > filtering sounds a duplicated work already done in Ceilometer. From the
> > use case of convergence, you can guess that this is acutally not limited
> > to the auto-scaling scenario.
> 
> Sure, but does convergence sound like it's *relevant* to the autoscaling
> case?

In future, probably yes, but right now, I think there are opportunities to
make the current autoscaling model (driven by ceilometer) a bit smarter and
more flexible.

Personally I'd rather stick to a callback/notification model where
possible, rather than relying on moving to a poll-all-the-things model for
convergence, although obviously that may be one possible mode of operation.

Steve



More information about the OpenStack-dev mailing list