[openstack-dev] [Heat][Ceilometer] A proposal to enhance ceilometer alarm
Qiming Teng
tengqim at linux.vnet.ibm.com
Mon Jul 7 10:03:57 UTC 2014
On Mon, Jul 07, 2014 at 03:46:19AM -0400, Eoghan Glynn wrote:
> > > Near the end of the Icehouse cycle, there was an attempt to implement
> > > this style of notification-based alarming but the feature did not land.
> >
> > After realizing 'Statistics' is not the ideal place for extension, I
> > took a step back and asked myself: "what am I really trying to get from
> > Ceilometer?" The answer seems to be an Alarm or Event, with some
> > informational fields telling me some context of such an Alarm or Event.
> > So I am now thinking of a EventAlarm in addition to ThresholdAlarm and
> > CombinationAlarm. The existing alarms are all based on meter samples.
> > Such an event based alarm would be very helpful to implement features
> > like keeping members of a AutoScalingGroup (or other Resource Group)
> > alive.
>
> So as I mentioned, we did have an attempt to provide notification-based
> alarming at the end of Icehouse:
>
> https://review.openstack.org/69473
>
> but that did not land.
>
> It might be feasible to resurrect this, based on the fact that the events
> API will shortly be available right across the range of ceilometer v2
> storage drivers (i.e. not just for sqlalchemy).
Resurrect this would be great. Also good news that other db backend
will be supported.
>
> However this is not currently a priority item on our roadmap (though
> as always, patches are welcome).
>
> Note though that the Heat-side logic to consume the event-alarm triggered
> by a compute.instance.delete event wouldn't be trivial, as Heat would have
> to start remembering which instances it had *itself* deleted as part of
> the normal growth and shrinkage pattern of an autoscaling group
>
> (so that it can distinguish a intended instance deletion from an accidental
> deletion)
>
> I'm open to correction, but AFAIK Heat does not currently record such
> state.
That is true. In the autoscaling case, there should be some additional
logics to be added if health maintenance is desired. See this thread:
http://lists.openstack.org/pipermail/openstack-dev/2014-July/039110.html
> > > Another option would be for Heat itself to consume notifications and/or
> > > periodically check the integrity of the autoscaling group via nova-api,
> > > to ensure no members have been inadvertently deleted.
> >
> > Yes. That has been considered by the Heat team as well. The only
> > concern regarding directly subscribing to notification and then do
> > filtering sounds a duplicated work already done in Ceilometer. From the
> > use case of convergence, you can guess that this is acutally not limited
> > to the auto-scaling scenario.
>
> Sure, but does convergence sound like it's *relevant* to the autoscaling
> case?
My understanding is that convergence is a much broader scope than just
autoscaling. The whole convergence proposal is a mixture of:
- Parallelizing stack operation so that it can scale;
- Make Heat aware of the states of physical resources;
- Enable Heat to evolve a stack from its current to its desired state;
- Make Heat aware of event notifications and take appropriate actions.
At marco level, convergence will make sure something desired will
happen; while autoscaling group is a micro-level thing where a lot of
details are not supposed to be escalated to the Convergence engine. By
details, I mean the specific metrics, threshold, adjustment, placement,
deletion policies.
*NOTE* that the above is only my personal understanding.
> > > Or would it require manual intervention?
> >
> > As I have noted above, getting notified by physical resource state
> > changes and then reacting properly is THE requirement. It is beyond
> > what auto-scaling does today. There are cases where manual intervention
> > is needed, while there are other cases where Heat can handle given
> > sufficient information.
>
> Can you provide some examples of those latter cases?
> (so as to ground this discussion solidly in the here-and-now)
One of the example use cases is about VM HA. We got requirements from
our customers to support VM failure detection and recovery, but they
don't want us to touch their VM images. We need a solution that can
detect Nova Server failures and recover them with configurable actions.
Heat side support for this is nothing more than a ResourceGroup that can
handle some customizable policies. The tricky part for us was about failure
events.
> > Okay. I admit that if the alarm is evaluated based on Statistics, these
> > are all true concerns. I didn't quite realize that before. What do you
> > think if Ceilometer provides an EventAlarm then? If Alarm is generated
> > from an Event, then the above context can be extracted, at least by
> > tweaking event_definitions.yaml?
>
> Possibly, yes.
>
> I'd imagine that such a feature would include the ability to request
> that certain event fields ("traits") are included in the alarm reason.
Yes. However, I'm supposing traits to be a deployment work rather than
patching Ceilometer, right?
> Cheers,
> Eoghan
>
More information about the OpenStack-dev
mailing list