[openstack-dev] [nova][service group]improve host state detection

John Garbutt john at johngarbutt.com
Mon Apr 28 14:33:00 UTC 2014


On 28 April 2014 13:30, Jiangying (Jenny) <jenny.jiangying at huawei.com> wrote:
> Nova now can detect host unreachable. But it fails to make out host
> isolation, host dead and nova compute service down. When host unreachable is
> reported, users have to find out the exact state by himself and then take
> the appropriate measure to recover. Therefore we’d like to improve the host
> detection for nova.
>
> Currently the service group API factors out the host detection and makes it
> a set of abstract internal APIs with a pluggable backend implementation. The
> backend we designed is as follows:
>
> A detection central agent is introduced. When a member joins into the
> service group, the member host starts to send network heartbeat to the
> central agent and writes timestamp in shared storage periodically. When the
> central agent stops receiving the network heartbeats from a member, it pings
> the member and checks the storage heartbeat before declaring the host to
> have failed.
>
> ----------------------------------------------------------------------------------------------------------------
>
> network heartbeat|network ping|storage heartbeat| state          | reason
>
> ------------------------|-----------------|------------------------|---------------------------|------------------------------------------
>
>         OK       |      -     |        -        | Running             | -
>
>       Not OK     |   Not OK   |     Not OK      | Dead               |
> hardware failure/abnormal host shut down
>
>       Not OK     |     OK     |     Not OK      | Service unreachable|
> service process crashed
>
>       Not OK     |   Not OK   |       OK        | Isolated           |
> network unreachable
>
> ----------------------------------------------------------------------------------------------------------------
>
> Based on the state recognition table, nova can discern the exact host state
> and assign the reasons.
>
> Thoughts?

I don't think Nova should try to include functionality that
re-implements other good monitoring tools (Nagios, etc)

Having said that, having a new service group API that uses information
from external tools to decide if a host is dead or not, and describes
why, is maybe worth considering.

John



More information about the OpenStack-dev mailing list