Open Stack

Fri Nov 16 11:44:03 UTC 2012

> >> We don't *have* to use rabbit to handle the notifications. The
> >> notification system could easily be extended to allow different
> >> event types to use different notifiers. For example, billing
> >> events
> >> could go to rabbit while monitoring/lifecycle events could go to a
> >> log file. Or, if we wanted to introduce a debugging event, that
> >> could send to statsd (or something) directly as it strips out the
> >> cruft from the message.
> 
> >Yes, so in ceilometer we've been making small steps towards that
> >idea with the concept of multiple publishers and transformers.
> >So a transformer would know how to distill what a publisher needs
> >from the raw data (strip out the cruft & massage into the expected
> >format) and then the publisher knows how to emit the data via
> >some conduit (over rabbit, or a UDP packet for stats, an RRD file,
> >a CloudWatch PutMetricData call, etc.).
> >
> > So I'm thinking we're not a million miles from each other on that
> > score, other than I had been assuming the publishers would live
> > in ceilo, and it would understand their requirements in terms of
> > cadence etc.
> > 
> > Were you more thinking of this logic living elsewhere?
> 
> So this should be a Ceilometer notifier, that lives in the Ceilometer
> code base and is a nova.conf --notification_driver setting for
> whomever deploys it. This implies there are two ways to get
> notifications out of Nova:
> 1. via the Rabbit Notifier with an external Worker/Consumer
> (preferred for monitoring/usage/billing)
> 2. via a specific Notifier (like
> https://github.com/openstack/nova/blob/master/nova/openstack/common/notifier/log_notifier.py)
> 
> Stuff such as converting format and varying polling rates seems to be
> something external to nova. Since the system will only issue
> notifications at the rate they occur. Higher sampling rates I think
> fall into the instrumentation category and should be dealt with
> separately.

OK, I think we need to distinguish an (at least blurry) line between
instrumentation and monitoring.

For me monitoring is mostly about coarse-grained observables that
allow user-oriented questions to be asked about cloud resources:

 - are my instances running hot?

 - are my volumes falling behind with queued I/O?

 - is my load balancer spitting out many 503s?

... etc.

Whereas instrumentation to me implies much more internal-facing
and fine-grained concerns such as:

 - what's the fault-rate & latency for a particular API?

 - how much time is being spent accessing the DB?

 - how many idle connections are currently in some pool? 

I'm maybe stating the obvious above, but the point is that its
the type of question being asked that distinguishes monitoring
from instrumentation, not the sampling rate.

For certain types of monitoring, I think we do need relatively
high sampling rates (e.g. once or twice a minute) that are near
constant (61s, 59s, 62s, ... as opposed to 45s, 75s, 52s, ...).
In that case, I'm not sure we can rely on the cadence of the 
notifications issued by a busy nova compute service. 

> >Certainly there would still be some time sensitivity, particularly
> >for metrics feeding into near-realtime monitoring. So for metering
> >feeding into non-realtime consumers (just as billing), we can
> >tolerate a bit of irregularity in the cadence and some delays in
> >the pipeline, as long as we maintain completeness. Whereas for
> >monitoring, we need to get at that data while its still fresh and
> >ensure its sampled at a near-constant rate. 
> 
> Sounds like instrumentation to me.

See above, could be considered instrumentation or monitoring IMO,
depending on the granularity of the question being asked.

> >> Also, they are ancillary to the task at hand (providing a cloud
> >> service) so their failure should not bring down the system. Which
> >> is
> >> why a queue-based approach seems the logical choice. Having nova
> >> call out seems wrong and if it did, it belongs as a new rabbit
> >> notifier where the person deploying that solution takes all
> >> responsibility.
> 
> >True that, we certainly need to be cognizant of the load imposed
> >on a possibly degraded system potentially making things worse.
> >Hence the leeriness about garnering the info ceilo needs from
> >the public nova-api.
> 
> The public api has no place for this stuff. I must have missed it,
> but where was that being proposed? Hitting HTTP for metrics is just
> wrong.

Yep, I think you're right about that. We kicked around some ideas
earlier on this thread that involved hitting nova-api, but I think
we're mostly agreed that's not the way to go. 

> >> The existing information gathered from the hypervisors could
> >> easily
> >> be extended with optional sections to cover all use cases. Much
> >> the
> >> same way MP3 and JPG has optional data blocks. Notifications do
> >> not
> >> use the Nova RPC protocol and should be versioned separately from
> >> it. The entire structure of the notification should be changed to
> >> allow for these "optional" blocks ... not only for flexibility,
> >> but
> >> to reduce the already monstrous payload size (do we need to do 2-3
> >> db accesses every time we send a notification?)
> 
> >So with nova-compute losing it's direct database access, then 2-3 DB
> >accesses per notification is not going to be a runner - all the
> >information we're talking about extracting here will I think have to
> >be available from the hypervisor, possibly mixed in with some cached
> >data retrieved by ceilo from the nova-api (e.g. on every polling
> >cycle
> >we wouldn't want to go back to the nova-api to figure out the
> >instance
> >flavor name, if that's not directly exposed by nova-compute but is
> >needed for metering purposes).
> 
> The big requirement for it today is for network information, which is
> already being cached in the Quantum driver. If we can separate the
> network and the compute notifications I think we're ok. Likewise
> with storage. The downside is we wouldn't be getting these
> notifications as atomic updates and that can lead to race
> conditions. But, so long as the time stamps are accurate within
> reason (NTP), we should be ok there. If we're dropping events we've
> got bigger issues to deal with.

Yep.

> Another possibility we're exploring is having a read-only mirror of
> the production database for this sort of stuff. That could be the
> "best practice" in these tight situations. But that's a story for
> another time :)

Interesting idea.

> So, we need to revisit the notification format wrt versioning,
> structure, payload size, content and overhead. Getting the data out
> and doing something with it is easily do-able via a worker/consumer
> or a proprietary notifier (and with no impact on nova core).

OK, there may be a terminology gap here, can you explain what you
mean by a "proprietary notifier" ... a non-standard notification_driver
that can be plugged into nova?

> Next we need to be very clear on what is instrumentation and what is
> monitoring/usage/billing/lifecycle.

Yes, see above ... I think the metering use-case is clearer, whereas
there's some blurriness on the boundary understood between monitoring
and instrumentation.  

Cheers,
Eoghan

Open Stack

[openstack-dev] [nova][ceilometer] model for ceilo/nova interaction going forward

OpenStack

Community

Documentation

Branding & Legal