Open Stack

Thu Nov 15 13:50:54 UTC 2012

Hmm, I'm not sure if this post is intended to be a reply to my previous post about stacktach-ceilometer integration or not, but here goes:

My biggest concern is there is still no differentiation between instrumentation and metering/monitoring in this solution. It sounds like we are still mixing requirements when these are two very different animals. Our solution for instrumentation (fast/frequent/small/unreliable) will quite likely have to be different from our usage/monitoring solution (large/slower/reliable)

I'm going to leave the instrumentation discussion for a different thread and focus on lifecycle/billing/usage/monitoring (call it what you like). Also, I'm not sure what the "ceilo message bus" is, so I'm going to assume it's the notification queues?

We don't *have* to use rabbit to handle the notifications. The notification system could easily be extended to allow different event types to use different notifiers. For example, billing events could go to rabbit while monitoring/lifecycle events could go to a log file. Or, if we wanted to introduce a debugging event, that could send to statsd (or something) directly as it strips out the cruft from the message.

So, the messages that we are interested in are larger/less frequent/less time sensitive (within reason) and very important. Also, they are ancillary to the task at hand (providing a cloud service) so their failure should not bring down the system. Which is why a queue-based approach seems the logical choice. Having nova call out seems wrong and if it did, it belongs as a new rabbit notifier where the person deploying that solution takes all responsibility. 

The existing information gathered from the hypervisors could easily be extended with optional sections to cover all use cases. Much the same way MP3 and JPG has optional data blocks. Notifications do not use the Nova RPC protocol and should be versioned separately from it. The entire structure of the notification should be changed to allow for these "optional" blocks ... not only for flexibility, but to reduce the already monstrous payload size (do we need to do 2-3 db accesses every time we send a notification?) 

-S

________________________________________
From: Eoghan Glynn [eglynn at redhat.com]
Sent: Wednesday, November 14, 2012 1:28 PM
To: OpenStack Development Mailing List
Subject: [openstack-dev] [nova][ceilometer] model for ceilo/nova interaction    going forward

Folks,

TL;DR: soliciting feedback on the best (most stable/supportable)
approach for ceilometer to interact with nova going forward.

Currently ceilo both consumes notifications from nova (instance
lifecycle events & the like) and also periodically polls libvirt to
extract more detailed info. This latter mechanism uses internal nova
classes, so we want to move towards a model that is more stable and
supportable into the future.

We are also currently limited to libvirt, so it would make sense to
move towards a more hypervisor-agnostic position, or at least to
provide wider support.

Now, there are at least 4 different approaches that could be followed,
each with its own advantages and disadvantages, so I just wanted to
call these out so to solicit some feedback and guidance from the nova
domain experts ...

1. Extend the existing os-server-diagnostics API extension to expose
   any additional stats that ceilo needs.

   +  would allow the ceilo compute agent to be scaled independently
      of the nova-compute node (i.e. no need for a 1:1 correspondence)
   -  the diagnostics returned are currently hypervisor-specific
   -  the additional nova-api-->nova-compute RPC call would add lag
      and impact timeliness for metrics gathering

2. Call the nova get_diagnostics RPC directly (as per the experimental
   patch proposed by Yunhong Jiang https://review.openstack.org/15952),
   or use a new RPC message specifically designed for this purpose.

   +/- as for #1, but also removes the lag involved in an additional
       hop between nova services
   -   calling RPC directly would expose ceilo to a much less stable
       (i.e. rapidly rev'd) API than would be the case for #1

3. Have nova itself emit metering messages directly onto the ceilo
   message bus, encompassing both lifecycle events and usage stats,
   to be picked up and persisted by the ceilo collector or other agent.

   - leaks ceilo concerns into nova
   - requires message bus usage, probably inappropriate for time-
     sensitive measurements feeding into near-realtime metrics.

4. Invert control and have the nova compute service itself call into a
   ceilo-provided API that abstracts the conduit used for publication
   (could be via the message bus, or UDP, or a direct call to a CW API)

   - a loaded nova compute service may fall behind in this periodic
     task, especially if the reporting cadence is configured high

So the question is how the nova domain experts see these options sizing
up?

Personally I'm liking option #2, aside from a lingering concern about
how rapidly RPC versioning is rev'd (which suggests the more sedate
pace of API versioning would be easier to consume). Also some statement
on whether RPC is envisaged as being externally-callable would be good.

Thoughts/feedback most welcome ...

Thanks,
Eoghan

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Open Stack

[openstack-dev] [nova][ceilometer] model for ceilo/nova interaction going forward

OpenStack

Community

Documentation

Branding & Legal