[openstack-dev] [Nova][Heat] How to reliably detect VM failures? (Zane Bitter)

WICKES, ROGER rw314w at att.com
Wed Mar 19 13:04:55 UTC 2014

> On 03/18/2014 07:54 AM, Qiming Teng wrote:
>> Hi, Folks,
>>    I have been trying to implement a HACluster resource type in Heat. I
>> haven't created a BluePrint for this because I am not sure everything
>> will work as expected.
>>    The most difficult issue here is to come up with a reliable VM failure
>> detection mechanism.  The service_group feature in Nova only concerns
>> about the OpenStack services themselves, not the VMs.  Considering that
>> in our customer's cloud environment, user provided images can be used,
>> we cannot assume some agents in the VMs to send heartbeat signals.

[Roger] My response is more of a user-oriented rather than developer-
oriented, but was asked on dev so...here goes:

When enabled, the hypervisor is always collecting (and sending to 
Ceilometer) basic cpu, memory stats that you can alarm on. 

For external monitoring, consider setting up a Nagios or Selenium server 
for agent-less monitoring. You can have it do the most basic heartbeat 
(ping) test; if the ping is slow for a period of say five minutes, or fails, alarm 
that you have a network problem. You can use Selenium to execute synthetic
transactions against whatever the server is supposed to provide; if it does it
for you, you can assume it is doing it for everyone else. If it fails, you can take action
You can also use Selenium to re-run selected OpenStack test cases to ensure your 
infrastructure is working properly.

>>    I have checked the 'instance' table in Nova database, it seemed that
>> the 'update_at' column is only updated when VM state changed and
>> reported.  If the 'heartbeat' messages are coming in from many VMs very
>> frequently, there could be a DB query performance/scalability issue,
>> right?

[Roger] For time-series, high-volume collection, consider going to a non-relational 
system like RRDTool, PyRRD, Graphite, etc. if you want to store the history and look 
for trends. 

>>    So, how can I detect VM failures reliably, so that I can notify Heat
>> to take the appropriate recovery action?

[Roger] When Nagios detects a problem, have it kick off the appropriate script
(shell script) that invokes the Heat API or other to fix the issue with the cluster. 
I think you were hoping that Heat could be coded to automagically fix any issue, 
but I think you may need to be more specific; develop specific use cases for what 
you mean by "VM failure", as the desired action may be different depending on 
the type of failure. 

> Qiming,
> Check out
> https://github.com/openstack/heat-templates/blob/master/cfn/F17/WordPress_Single_Instance_With_HA.template

More information about the OpenStack-dev mailing list