[openstack-dev] [nova][service group]improve host state detection
John Garbutt
john at johngarbutt.com
Mon Apr 28 14:33:00 UTC 2014
On 28 April 2014 13:30, Jiangying (Jenny) <jenny.jiangying at huawei.com> wrote:
> Nova now can detect host unreachable. But it fails to make out host
> isolation, host dead and nova compute service down. When host unreachable is
> reported, users have to find out the exact state by himself and then take
> the appropriate measure to recover. Therefore we’d like to improve the host
> detection for nova.
>
> Currently the service group API factors out the host detection and makes it
> a set of abstract internal APIs with a pluggable backend implementation. The
> backend we designed is as follows:
>
> A detection central agent is introduced. When a member joins into the
> service group, the member host starts to send network heartbeat to the
> central agent and writes timestamp in shared storage periodically. When the
> central agent stops receiving the network heartbeats from a member, it pings
> the member and checks the storage heartbeat before declaring the host to
> have failed.
>
> ----------------------------------------------------------------------------------------------------------------
>
> network heartbeat|network ping|storage heartbeat| state | reason
>
> ------------------------|-----------------|------------------------|---------------------------|------------------------------------------
>
> OK | - | - | Running | -
>
> Not OK | Not OK | Not OK | Dead |
> hardware failure/abnormal host shut down
>
> Not OK | OK | Not OK | Service unreachable|
> service process crashed
>
> Not OK | Not OK | OK | Isolated |
> network unreachable
>
> ----------------------------------------------------------------------------------------------------------------
>
> Based on the state recognition table, nova can discern the exact host state
> and assign the reasons.
>
> Thoughts?
I don't think Nova should try to include functionality that
re-implements other good monitoring tools (Nagios, etc)
Having said that, having a new service group API that uses information
from external tools to decide if a host is dead or not, and describes
why, is maybe worth considering.
John
More information about the OpenStack-dev
mailing list