[openstack-dev] [Horizon] [UX] Design for Alarming and Alarm Management

Eoghan Glynn eglynn at redhat.com
Mon Jun 16 18:49:43 UTC 2014



> On Jun 16, 2014, at 10:56 AM, Eoghan Glynn <eglynn at redhat.com> wrote:
> 
> > 
> > Apologies for the top-posting, but just wanted to call out some
> > potential confusion that arose on the #os-ceilometer channel earlier
> > today.
> > 
> > TL;DR: the UI shouldn't assume a 1:1 mapping between alarms and
> >       resources, since this mapping does not exist in general
> 
> Thanks for the clarification on this Eoghan. After reading the IRC chat and
> e-mail thread I’m now understanding that there are alarms that can be
> created for things like “Alarm me when a new instance is created” that have
> nothing to do with monitoring instances. Am I correct?

More something like:

 "Alarm me when the average CPU util throughout all instances in an
  autoscaling group suggests that the group is under-scaled"

In that case, the alarm may map onto zero resources initially, then
N actually-existing resources at any given point in time (where N
lies between some high and low water marks, but is not constant in
time).

That's an example of an 1:N mapping between alarm and resource names,
but where the set of N resource names is potentially constantly varying
(or apparently static, if the load on the autoscaling group is relatively
constant).

> Are there other cases we should consider here? 

Another example would be:

 "Alarm me when the number of instances owned by a particular tenant
  exceeds some threshold"

(... actually, that would require an update to the alarm API to
 accommodate the new selectable cardinality aggregate, but would
 be easy to do) 

Well, I'd recommend removing the concept of alarms and resources being
*directly* tied to each other.

Cheers,
Eoghan

> I’ve updated the latest version of wireframes to
> reflect an example of an alarm like this (See Alarm 4 in tables). Also, I
> got rid of the required mark on Resource in the Add Alarm modal. I will be
> sending a link these updated wireframes along with feedback to Christian’s
> latest comments in the next few minutes...
> 
> Best,
> Liz
> 
> > 
> > Background: See ML post[1]
> > 
> > Discussion: See IRC log [2]
> >            Ctrl+F: "Let's see what the UI guys think about it"
> > 
> > Cheers,
> > Eoghan
> > 
> > [1]
> > http://lists.openstack.org/pipermail/openstack-dev/2014-June/037788.html
> > [2]
> > http://eavesdrop.openstack.org/irclogs/%23openstack-ceilometer/%23openstack-ceilometer.2014-06-16.log
> > 
> > 
> > ----- Original Message -----
> >> Hi all,
> >> 
> >> Thanks again for the great comments on the initial cut of wireframes. I’ve
> >> updated them a fair amount based on feedback in this e-mail thread along
> >> with the feedback written up here:
> >> https://etherpad.openstack.org/p/alarm-management-page-design-discussion
> >> 
> >> Here is a link to the new version:
> >> http://people.redhat.com/~lsurette/OpenStack/Alarm%20Management%20-%202014-06-05.pdf
> >> 
> >> And a quick explanation of the updates that I made from the last version:
> >> 
> >> 1) Removed severity.
> >> 
> >> 2) Added Status column. I also added details around the fact that users
> >> can
> >> enable/disable alerts.
> >> 
> >> 3) Updated Alarm creation workflow to include choosing the project and
> >> user
> >> (optionally for filtering the resource list), choosing resource, and
> >> allowing for choose of amount of time to monitor for alarming.
> >>     -Perhaps we could be even more sophisticated for how we let users
> >>     filter
> >>     down to find the right resources that they want to monitor for alarms?
> >> 
> >> 4) As for notifying users…I’ve updated the “Alarms” section to be “Alarms
> >> History”. The point here is to show any Alarms that have occurred to
> >> notify
> >> the user. Other notification ideas could be to allow users to get notified
> >> of alerts via e-mail (perhaps a user setting?). I’ve added a wireframe for
> >> this update in User Settings. Then the Alarms Management section would
> >> just
> >> be where the user creates, deletes, enables, and disables alarms. Do you
> >> still think we don’t need the “alarms” tab? Perhaps this just becomes
> >> iteration 2 and is left out for now as you mention in your etherpad.
> >> 
> >> 5) Question about combined alarms…currently I’ve designed it so that a
> >> user
> >> could create multiple levels in the “Alarm When…” section. They could
> >> combine these with AND/ORs. Is this going far enough? Or do we actually
> >> need
> >> to allow users to combine Alarms that might watch different resources?
> >> 
> >> 6) I updated the Actions column to have the “More” drop down which is
> >> consistent with other tables in Horizon.
> >> 
> >> 7) Added in a section in the “Add Alarm” workflow for “Actions after
> >> Alarm”.
> >> I’m thinking we could have some sort of If State is X, do X type
> >> selections,
> >> but I’m looking to understand more details about how the backend works for
> >> this feature. Eoghan gave examples of logging and potentially scaling out
> >> via Heat. Would simple drop downs support these events?
> >> 
> >> 8) I can definitely add in a “scheduling” feature with respect to Alarms.
> >> I
> >> haven’t added it in yet, but I could see this being very useful in future
> >> revisions of this feature.
> >> 
> >> 9) Another though is that we could add in some padding for outlier data as
> >> Eoghan mentioned. Perhaps a setting for “This has happened 3 times over
> >> the
> >> last minute, so now send an alarm.”?
> >> 
> >> A new round of feedback is of course welcome :)
> >> 
> >> Best,
> >> Liz
> >> 
> >> On Jun 4, 2014, at 1:27 PM, Liz Blanchard <lsurette at redhat.com> wrote:
> >> 
> >>> Thanks for the excellent feedback on these, guys! I’ll be working on
> >>> making
> >>> updates over the next week and will send a fresh link out when done.
> >>> Anyone else with feedback, please feel free to fire away.
> >>> 
> >>> Best,
> >>> Liz
> >>> On Jun 4, 2014, at 12:33 PM, Eoghan Glynn <eglynn at redhat.com> wrote:
> >>> 
> >>>> 
> >>>> Hi Liz,
> >>>> 
> >>>> Two further thoughts occurred to me after hitting send on
> >>>> my previous mail.
> >>>> 
> >>>> First, is the concept of alarm dimensioning; see my RDO Ceilometer
> >>>> getting started guide[1] for an explanation of that notion.
> >>>> 
> >>>> "A key associated concept is the notion of dimensioning which defines
> >>>> the
> >>>> set of matching meters that feed into an alarm evaluation. Recall that
> >>>> meters are per-resource-instance, so in the simplest case an alarm might
> >>>> be defined over a particular meter applied to all resources visible to a
> >>>> particular user. More useful however would the option to explicitly
> >>>> select which specific resources we're interested in alarming on. On one
> >>>> extreme we would have narrowly dimensioned alarms where this selection
> >>>> would have only a single target (identified by resource ID). On the
> >>>> other
> >>>> extreme, we'd have widely dimensioned alarms where this selection
> >>>> identifies many resources over which the statistic is aggregated, for
> >>>> example all instances booted from a particular image or all instances
> >>>> with matching user metadata (the latter is how Heat identifies
> >>>> autoscaling groups)."
> >>>> 
> >>>> We'd have to think about how that concept is captured in the
> >>>> UX for alarm creation/update.
> >>>> 
> >>>> Second, there are a couple of more advanced alarming features
> >>>> that were added in Icehouse:
> >>>> 
> >>>> 1. The ability to constrain alarms on time ranges, such that they
> >>>> would only fire say during 9-to-5 on a weekday. This would
> >>>> allow for example different autoscaling policies to be applied
> >>>> out-of-hours, when resource usage is likely to be cheaper and
> >>>> manual remediation less straight-forward.
> >>>> 
> >>>> 2. The ability to exclude low-quality datapoints with anomolously
> >>>> low sample counts. This allows the leading edge of the trend of
> >>>> widely dimensioned alarms not to be skewed by eagerly-reporting
> >>>> outliers.
> >>>> 
> >>>> Perhaps not in a first iteration, but at some point it may make sense
> >>>> to expose these more advanced features in the UI.
> >>>> 
> >>>> Cheers,
> >>>> Eoghan
> >>>> 
> >>>> [1] http://openstack.redhat.com/CeilometerQuickStart
> >>>> 
> >>>> 
> >>>> 
> >>>> ----- Original Message -----
> >>>>> 
> >>>>> Hi Liz,
> >>>>> 
> >>>>> Looks great!
> >>>>> 
> >>>>> Some thoughts on the wireframe doc:
> >>>>> 
> >>>>> * The description of form:
> >>>>> 
> >>>>>  "If CPU Utilization exceeds 80%, send alarm."
> >>>>> 
> >>>>> misses the time-window aspect of the alarm definition.
> >>>>> 
> >>>>> Whereas the boilerplate default descriptions generated by
> >>>>> ceilometer itself:
> >>>>> 
> >>>>>  "cpu_util > 70.0 during 3 x 600s"
> >>>>> 
> >>>>> captures this important info.
> >>>>> 
> >>>>> * The metric names, e.g. "CPU Utilization", are not an exact
> >>>>> match for the meter names used by ceilometer, e.g. "cpu_util".
> >>>>> 
> >>>>> * Non-admin users can create alarms in ceilometer:
> >>>>> 
> >>>>> "This is where admins can come in and
> >>>>> define and edit any alarms they want
> >>>>> the environment to use."
> >>>>> 
> >>>>> (though these alarms will only have visibility onto the stats
> >>>>> that would be accessible to the user on behalf of whom the
> >>>>> alarm is being evaluated)
> >>>>> 
> >>>>> * There's no concept currently of alarm severity.
> >>>>> 
> >>>>> * "Should users be able to enable/dis-able alarms."
> >>>>> 
> >>>>> Yes, the API allows for disabled (i.e. non-evaluated) alarms.
> >>>>> 
> >>>>> * "Should users be able to own/assign alarms?"
> >>>>> 
> >>>>> Only admin users can create an alarm on behalf of another
> >>>>> user/tenant.
> >>>>> 
> >>>>> * "Should users be able to acknowledge, close alarms?"
> >>>>> 
> >>>>> No, we have no concept of ACKing an alarm.
> >>>>> 
> >>>>> * "Admins can also see a full list of all Alarms that have
> >>>>> taken place in the past."
> >>>>> 
> >>>>> In ceilometer terminology, we refer to this as alarm history
> >>>>> or alarm change events.
> >>>>> 
> >>>>> * "CPU Utilization exceeded 80%."
> >>>>> 
> >>>>> Again good to capture the duration in that description of the
> >>>>> event.
> >>>>> 
> >>>>> * "Within the Overview section, there should be a new tab that allows
> >>>>> the
> >>>>> user to click and view all Alarms that have occurred in their
> >>>>> environment."
> >>>>> 
> >>>>> Not sure really what "environment" means here. Non-admin tenants only
> >>>>> have visibility to their own alarm, whereas admins have visibility to
> >>>>> all alarms.
> >>>>> 
> >>>>> * "This list would keep the latest  alarms."
> >>>>> 
> >>>>> Presumably this would be based on querying the alarm-history API,
> >>>>> as opposed to an assumption that Horizon is consuming the actual
> >>>>> alarm notifications?
> >>>>> 
> >>>>> Cheers,
> >>>>> Eoghan
> >>>>> 
> >>>>> ----- Original Message -----
> >>>>>> Hi All,
> >>>>>> 
> >>>>>> I’ve recently put together a set of wireframes[1] around Alarm
> >>>>>> Management
> >>>>>> that would support the following blueprint:
> >>>>>> https://blueprints.launchpad.net/horizon/+spec/ceilometer-alarm-management-page
> >>>>>> 
> >>>>>> If you have a chance it would be great to hear any feedback that folks
> >>>>>> have
> >>>>>> on this direction moving forward with Alarms.
> >>>>>> 
> >>>>>> Best,
> >>>>>> Liz
> >>>>>> 
> >>>>>> [1]
> >>>>>> http://people.redhat.com/~lsurette/OpenStack/Alarm%20Management%20-%202014-05-30.pdf
> >>>>>> 
> >>>>>> _______________________________________________
> >>>>>> OpenStack-dev mailing list
> >>>>>> OpenStack-dev at lists.openstack.org
> >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>>>> 
> >>>>> 
> >>>>> _______________________________________________
> >>>>> OpenStack-dev mailing list
> >>>>> OpenStack-dev at lists.openstack.org
> >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >>>>> 
> >>> 
> >> 
> >> 
> 
> 



More information about the OpenStack-dev mailing list