[openstack-dev] [nova][service group]improve host state detection

Jay Pipes jaypipes at gmail.com
Mon Apr 28 14:44:47 UTC 2014


cc'ing Intel and Ericsson engineers who are interested in a similar
plan...

On Mon, 2014-04-28 at 15:33 +0100, John Garbutt wrote:
> On 28 April 2014 13:30, Jiangying (Jenny) <jenny.jiangying at huawei.com> wrote:
> > Nova now can detect host unreachable. But it fails to make out host
> > isolation, host dead and nova compute service down. When host unreachable is
> > reported, users have to find out the exact state by himself and then take
> > the appropriate measure to recover. Therefore we’d like to improve the host
> > detection for nova.
> >
> > Currently the service group API factors out the host detection and makes it
> > a set of abstract internal APIs with a pluggable backend implementation. The
> > backend we designed is as follows:
> >
> > A detection central agent is introduced. When a member joins into the
> > service group, the member host starts to send network heartbeat to the
> > central agent and writes timestamp in shared storage periodically. When the
> > central agent stops receiving the network heartbeats from a member, it pings
> > the member and checks the storage heartbeat before declaring the host to
> > have failed.
> >
> > ----------------------------------------------------------------------------------------------------------------
> >
> > network heartbeat|network ping|storage heartbeat| state          | reason
> >
> > ------------------------|-----------------|------------------------|---------------------------|------------------------------------------
> >
> >         OK       |      -     |        -        | Running             | -
> >
> >       Not OK     |   Not OK   |     Not OK      | Dead               |
> > hardware failure/abnormal host shut down
> >
> >       Not OK     |     OK     |     Not OK      | Service unreachable|
> > service process crashed
> >
> >       Not OK     |   Not OK   |       OK        | Isolated           |
> > network unreachable
> >
> > ----------------------------------------------------------------------------------------------------------------
> >
> > Based on the state recognition table, nova can discern the exact host state
> > and assign the reasons.
> >
> > Thoughts?
> 
> I don't think Nova should try to include functionality that
> re-implements other good monitoring tools (Nagios, etc)

Agreed.

> Having said that, having a new service group API that uses information
> from external tools to decide if a host is dead or not, and describes
> why, is maybe worth considering.

Also agreed.

FYI, related blueprint from Ericsson: 

https://review.openstack.org/#/c/87978/

I am -1 on the above blueprint not because I don't see the value in
having nic state play a part in service group management, but because I
don't see a reason to have the resource tracker (which manages resource
usage, not state) or scheduler implement agent state checks.

Best,
-jay




More information about the OpenStack-dev mailing list