[openstack-dev] [new][cloudpulse] Announcing a project to HealthCheck OpenStack deployments

Ian Wells ijw.ubuntu at cack.org.uk
Thu May 14 00:04:51 UTC 2015


On 13 May 2015 at 10:30, Vinod Pandarinathan (vpandari) <vpandari at cisco.com>
wrote:

> - Traditional monitoring tools (Nagios, Zabbix, ....) are necessary anyway
> for infrastructure monitoring (CPU, RAM, disks, operating system, RabbitMQ,
> databases and more) and diagnostic purposes. Adding OpenStack service
> checks is fairly easy if you already have the toolchain.
>
>  The solution is for health-checking, which includes periodically running
> light/mid/heavy
> Control and data plane tests and provide test data. The tool shall not
> have any dependency on one particular monitoring tool
> If monitoring tool is installed, then monitoring data shall be exposed to
> the applications in a consumable fashion.
> As I mentioned earlier, we are not replacing any monitoring solution
> available out there we are leveraging those solutions
>  and provide  a clean interface so that the application/tenants and
> Operators know if the cloud is healthy.
>

To rephrase this:

- Zabbix and friends will monitor an operator's cloud and tell the operator
bad things are happening.  Or they can monitor an application's VMs and see
if the app is happy, and tell the app or its owner.
- Ceilometer will front cloud monitoring solutions and offer those
statistics to tenants of the cloud in ways that (ideally) make sense to the
client.  It lets tenants see stats they couldn't get for themselves.

This isn't quite what we're trying to address.  We had one specific use
case: a cloud application that needs to provide reasonably high
availability uses the Openstack APIs occasionally to try and correct
problems (VM died, app overloaded, etc.) - a pretty normal cloud
application.  If you're interested in maintaining service, you need to know
about single points of failure to work around them, and the cloud control
plane failing is a single point of failure - the APIs stop working, and the
app runs just fine until a second failure that causes them to be used, and
if you haven't done something by that point you get a meltdown.  The idea
of CloudPulse was to be able to say 'the cloud APIs are operating normally'
to applications that are interested.  If they're *not* normal then the
application can take corrective action; for instance, spinning up extra
capacity in another cloud and moving traffic over there.

As you can see, that's a cross-domain sort of monitoring similar to
Ceilometer - the tenant finding out information about the infrastructure
that they can't see directly.  That said, it's a very concise summary
('working'), and we also had in mind that you ran the tests to freshen the
results if the tests hadn't been run recently, rather than looping them
continually.  Also, the history of the results are not really relevant - my
app cares about about whether the control plane works *now*, not if it
worked for 8 hours out of the last 24.

We're scratching an itch.  Absolutely the point of mailing everyone about
it was to see if anyone had better scratching tools, and if people would
like to chat about it at the summit.  What seems to have come out of it is
that yes, there are tools out there that might be usable for the purpose,
and we'd love to hear your opinions and what ideas you have about how we
should do this.  Apparently there are also a lot of people with slightly
different itches to scratch, and I hope you all take the opportunity to get
together at the summit too.
-- 
Ian.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150513/f59e829d/attachment.html>


More information about the OpenStack-dev mailing list