[openstack-dev] [Nova][Heat] How to reliably detect VM failures? (Zane Bitter)
rw314w at att.com
Wed Mar 19 13:04:55 UTC 2014
> On 03/18/2014 07:54 AM, Qiming Teng wrote:
>> Hi, Folks,
>> I have been trying to implement a HACluster resource type in Heat. I
>> haven't created a BluePrint for this because I am not sure everything
>> will work as expected.
>> The most difficult issue here is to come up with a reliable VM failure
>> detection mechanism. The service_group feature in Nova only concerns
>> about the OpenStack services themselves, not the VMs. Considering that
>> in our customer's cloud environment, user provided images can be used,
>> we cannot assume some agents in the VMs to send heartbeat signals.
[Roger] My response is more of a user-oriented rather than developer-
oriented, but was asked on dev so...here goes:
When enabled, the hypervisor is always collecting (and sending to
Ceilometer) basic cpu, memory stats that you can alarm on.
For external monitoring, consider setting up a Nagios or Selenium server
for agent-less monitoring. You can have it do the most basic heartbeat
(ping) test; if the ping is slow for a period of say five minutes, or fails, alarm
that you have a network problem. You can use Selenium to execute synthetic
transactions against whatever the server is supposed to provide; if it does it
for you, you can assume it is doing it for everyone else. If it fails, you can take action
You can also use Selenium to re-run selected OpenStack test cases to ensure your
infrastructure is working properly.
>> I have checked the 'instance' table in Nova database, it seemed that
>> the 'update_at' column is only updated when VM state changed and
>> reported. If the 'heartbeat' messages are coming in from many VMs very
>> frequently, there could be a DB query performance/scalability issue,
[Roger] For time-series, high-volume collection, consider going to a non-relational
system like RRDTool, PyRRD, Graphite, etc. if you want to store the history and look
>> So, how can I detect VM failures reliably, so that I can notify Heat
>> to take the appropriate recovery action?
[Roger] When Nagios detects a problem, have it kick off the appropriate script
(shell script) that invokes the Heat API or other to fix the issue with the cluster.
I think you were hoping that Heat could be coded to automagically fix any issue,
but I think you may need to be more specific; develop specific use cases for what
you mean by "VM failure", as the desired action may be different depending on
the type of failure.
> Check out
More information about the OpenStack-dev