[openstack-dev] [nova][service group]improve host state detection
Jay Pipes
jaypipes at gmail.com
Mon Apr 28 14:44:47 UTC 2014
cc'ing Intel and Ericsson engineers who are interested in a similar
plan...
On Mon, 2014-04-28 at 15:33 +0100, John Garbutt wrote:
> On 28 April 2014 13:30, Jiangying (Jenny) <jenny.jiangying at huawei.com> wrote:
> > Nova now can detect host unreachable. But it fails to make out host
> > isolation, host dead and nova compute service down. When host unreachable is
> > reported, users have to find out the exact state by himself and then take
> > the appropriate measure to recover. Therefore we’d like to improve the host
> > detection for nova.
> >
> > Currently the service group API factors out the host detection and makes it
> > a set of abstract internal APIs with a pluggable backend implementation. The
> > backend we designed is as follows:
> >
> > A detection central agent is introduced. When a member joins into the
> > service group, the member host starts to send network heartbeat to the
> > central agent and writes timestamp in shared storage periodically. When the
> > central agent stops receiving the network heartbeats from a member, it pings
> > the member and checks the storage heartbeat before declaring the host to
> > have failed.
> >
> > ----------------------------------------------------------------------------------------------------------------
> >
> > network heartbeat|network ping|storage heartbeat| state | reason
> >
> > ------------------------|-----------------|------------------------|---------------------------|------------------------------------------
> >
> > OK | - | - | Running | -
> >
> > Not OK | Not OK | Not OK | Dead |
> > hardware failure/abnormal host shut down
> >
> > Not OK | OK | Not OK | Service unreachable|
> > service process crashed
> >
> > Not OK | Not OK | OK | Isolated |
> > network unreachable
> >
> > ----------------------------------------------------------------------------------------------------------------
> >
> > Based on the state recognition table, nova can discern the exact host state
> > and assign the reasons.
> >
> > Thoughts?
>
> I don't think Nova should try to include functionality that
> re-implements other good monitoring tools (Nagios, etc)
Agreed.
> Having said that, having a new service group API that uses information
> from external tools to decide if a host is dead or not, and describes
> why, is maybe worth considering.
Also agreed.
FYI, related blueprint from Ericsson:
https://review.openstack.org/#/c/87978/
I am -1 on the above blueprint not because I don't see the value in
having nic state play a part in service group management, but because I
don't see a reason to have the resource tracker (which manages resource
usage, not state) or scheduler implement agent state checks.
Best,
-jay
More information about the OpenStack-dev
mailing list