[openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Qiming Teng tengqim at linux.vnet.ibm.com
Thu Mar 20 03:48:32 UTC 2014


On Wed, Mar 19, 2014 at 01:04:55PM +0000, WICKES, ROGER wrote:
> > On 03/18/2014 07:54 AM, Qiming Teng wrote:
> >> Hi, Folks,
> >>
> >>    I have been trying to implement a HACluster resource type in Heat. I
> >> haven't created a BluePrint for this because I am not sure everything
> >> will work as expected.
> ...
> >>    The most difficult issue here is to come up with a reliable VM failure
> >> detection mechanism.  The service_group feature in Nova only concerns
> >> about the OpenStack services themselves, not the VMs.  Considering that
> >> in our customer's cloud environment, user provided images can be used,
> >> we cannot assume some agents in the VMs to send heartbeat signals.
> 
> [Roger] My response is more of a user-oriented rather than developer-
> oriented, but was asked on dev so...here goes:
> 
> When enabled, the hypervisor is always collecting (and sending to 
> Ceilometer) basic cpu, memory stats that you can alarm on. 
> http://docs.openstack.org/trunk/openstack-ops/content/logging_monitoring.html

[Qiming]
We are currently looking into this, but not quite sure whether we can
make a conclusion that a VM failed if we are not receiving some CPU or
memory stats from ceilometer.  How confident can we be?  

> For external monitoring, consider setting up a Nagios or Selenium server 
> for agent-less monitoring. You can have it do the most basic heartbeat 
> (ping) test; if the ping is slow for a period of say five minutes, or fails, alarm 
> that you have a network problem. You can use Selenium to execute synthetic
> transactions against whatever the server is supposed to provide; if it does it
> for you, you can assume it is doing it for everyone else. If it fails, you can take action
> http://www.seleniumhq.org
> You can also use Selenium to re-run selected OpenStack test cases to ensure your 
> infrastructure is working properly.

[Qiming]
Thanks for the pointer.  We are actually looking into some automated
testing software, including cloudbench.  However, I am still a little
conservative on introducing other software into a OpenStack deployment
for VM failure detection. It means to the operators a lot of additional
work for setup, upgrade and management. If I am talking to our customer
about this, they will say OpenStack is not doing a good job when
comparing to vCenter (vSphere VM HA), CloudStack (VM cluster) or Windows
Azure (VM HA).

> >>    I have checked the 'instance' table in Nova database, it seemed that
> >> the 'update_at' column is only updated when VM state changed and
> >> reported.  If the 'heartbeat' messages are coming in from many VMs very
> >> frequently, there could be a DB query performance/scalability issue,
> >> right?
> 
> [Roger] For time-series, high-volume collection, consider going to a non-relational 
> system like RRDTool, PyRRD, Graphite, etc. if you want to store the history and look 
> for trends. 

[Qiming]
Good suggestion.  Will get back to this when the scalability/performance
proves to be a issue.

> >>    So, how can I detect VM failures reliably, so that I can notify Heat
> >> to take the appropriate recovery action?
> 
> [Roger] When Nagios detects a problem, have it kick off the appropriate script
> (shell script) that invokes the Heat API or other to fix the issue with the cluster. 
> I think you were hoping that Heat could be coded to automagically fix any issue, 
> but I think you may need to be more specific; develop specific use cases for what 
> you mean by "VM failure", as the desired action may be different depending on 
> the type of failure. 

[Qiming]
We will work on this as a prototype, but at the end of the day, I really
want to see VM failure detection an integral part of OpenStack (be it
nova, ceilometer, or somewhere else).  I agree with you on that maybe we
need to come up with an open design what 'VM failure' means in different
setup, for different workloads.  Heartbeat is only one of the options,
other options may include CPU utilization, disk activity and network
I/O.  So ... I am doubting if there exist a 'reliable' way at all.




More information about the OpenStack-dev mailing list