Open Stack

Mon Jun 13 10:52:29 UTC 2016

Hi,

I'm working in the OPNFV Doctor project that is about fault management and maintenance (NFV). The goal of the project is to build fault management and maintenance framework for high availability of Network Services on top of virtualized infrastructure.
https://wiki.opnfv.org/display/doctor

Currently there is already landed effort to OpenStack to have ability to detect failures fast, change states in OpenStack (Nova), add state information that was missing and also to expose that to owner of a VM. Also alarm is triggered. By all this one can now rely the states and get notice about faults in a split second. Surely with system configured monitor different faults and make actions based configured policies, or leave some actions for consumers of the alarms risen.

For maintenance I had a session in Austin to talk with Ops and Nova core about the maintenance part. There it was seen that Nova didn't want more specific information about host maintenance (maintenance state, maintenance window...), so as a result of the discussion there is a spec that was now transferred to Ocata: https://review.openstack.org/310510/
The spec proposes a link to Nova external tool to provide more specific information about host (compute) maintenance and by latest comments it could have any host specific extra information to the same place (for example you have mentioned event history). Still if looking this kind of tool, why not make it configurable for anything convenient for different operator scenario like automatic operations if so wanted. Anyhow project like Nova do not want big new functionalities, so all "more complex flows" should reside somewhere outside.

Br,
Tomi

> -----Original Message-----
> From: Adam Spiers [mailto:aspiers at suse.com]
> Sent: Monday, June 13, 2016 12:19 PM
> To: openstack-operators mailing list <openstack-
> operators at lists.openstack.org>
> Subject: [Openstack-operators] [HA] RFC: user story including hypervisor
> reservation / host maintenance / storage AZs / event history (fwd)
> 
> Hi all,
> 
> Apologies for not thinking to Cc this openstack-operators list first
> time round when I sent the below mail!  It concerns four usage
> scenarios which all principally involve cloud operators, so with
> hindsight that was a really stupid omission :-/
> 
> I would be very interested to hear both:
> 
>   a) whether you think our proposal to create four new user stories
>      for each of these makes sense, and
> 
>   b) feedback on any of the individual usage scenarios.
> 
> Thanks a lot!
> Adam
> 
> ----- Forwarded message from Adam Spiers <aspiers at suse.com> -----
> 
> Date: Wed, 8 Jun 2016 00:19:52 +0100
> From: Adam Spiers <aspiers at suse.com>
> To: openstack-dev mailing list <openstack-dev at lists.openstack.org>
> Cc: OpenStack Product Working Group list <product-wg at lists.openstack.org>
> Subject: [openstack-dev] [HA] RFC: user story including hypervisor
> reservation / host maintenance / storage AZs / event history
> Reply-To: "OpenStack Development Mailing List (not for usage questions)"
> <openstack-dev at lists.openstack.org>
> 
> [Cc'ing product-wg@ - when replying, first please consider whether
> cross-posting is appropriate]
> 
> Hi all,
> 
> Currently the OpenStack HA community is putting a lot of effort into
> converging on a single upstream solution for high availability of VMs
> and hypervisors[0], and we had a lot of very productive discussions in
> Austin on this topic[1].
> 
> One of the first areas of focus is the high level user story:
> 
>    http://specs.openstack.org/openstack/openstack-user-stories/user-
> stories/proposed/ha_vm.html
> 
> In particular, there is an open review on which we could use some
> advice from the wider community.  The review proposes adding four
> extra usage scenarios to the existing user story.  All of these
> scenarios are to some degree related to HA of VMs and hypervisors,
> however none of them exclusively - they all have scope extending to
> other areas beyond HA.  Here's a very brief summary of all four, as
> they relate to HA:
> 
> 1. "Sticky" shared storage zones
> 
>    Scenario: all compute hosts have access to exactly one shared
>    storage "availability zone" (potentially independent of the normal
>    availability zones).  For example, there could be multiple NFS
>    servers, and every compute host has /var/lib/nova/instances mounted
>    to one of them.  On first boot, each VM is *implicitly* assigned to
>    a zone, depending on which compute host nova-scheduler picks for it
>    (so this could be more or less random).  Subsequent operations such
>    as "nova evacuate" would need to ensure the VM only ever moves to
>    other hosts in the same zone.
> 
> 2. Hypervisor reservation
> 
>    The operator wants a mechanism for reserving some compute hosts
>    exclusively for use as failover hosts on which to automatically
>    resurrect VMs from other failed compute nodes.
> 
> 3. Host maintenance
> 
>    The operator wants a mechanism for flagging hosts as undergoing
>    maintenance, so that the HA mechanisms for automatic recovery are
>    temporarily disabled during the maintenance window.
> 
> 4. Event history
> 
>    The operator wants a way to retrieve the history of what, when,
>    where and how the HA automatic recovery mechanism is performed.
> 
> And here's the review in question:
> 
>    https://review.openstack.org/#/c/318431/
> 
> My first instinct was that all of these scenarios are sufficiently
> independent, complex, and extend far enough outside HA scope, that
> they deserve to live in four separate user stories, rather than adding
> them to our existing "HA for VMs" user story.  This could also
> maximise the chances of converging on a single upstream solution for
> each which works both inside and outside HA contexts.  (Please read
> the review's comments for much more detail on these arguments.)
> 
> However, others made the very valid point that since there are
> elements of all these stories which are indisputably related to HA for
> VMs, we still need the existing user story for HA VMs to cover them,
> so that it can provide "the big picture" which will tie together all
> the different strands of work it requires.
> 
> So we are currently proposing to take the following steps:
> 
>  - Propose four new user stories for each of the above scenarios.
> 
>  - Link to the new stories from the "Related User Stories" section of
>    the existing HA VMs story.
> 
>  - Extend the existing story so that it covers the HA-specific aspects of
>    the four cases, leaving any non-HA aspects to be covered by the newly
>    linked stories.
> 
> Then each story would go through the standard workflow defined by the PWG:
> 
>    https://wiki.openstack.org/wiki/ProductTeam/User_Stories
> 
> Does this sound reasonable, or is there a better way?
> 
> BTW, whilst this email is primarily asking for advice on the process,
> feedback on each story is also welcome, whether it's "good idea", "you
> can already do that", or "terrible idea!" ;-)  However please first
> read the comments on the above review, as the obvious points have
> probably already been covered :-)
> 
> Thanks a lot!
> 
> Adam
> 
> [0] A complete description of the problem area and existing solutions
>     was given in this talk:
> 
>       https://www.openstack.org/videos/video/high-availability-for-pets-
> and-hypervisors-state-of-the-nation
> 
> [1] https://etherpad.openstack.org/p/newton-instance-ha
> 
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> ----- End forwarded message -----
> 
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Open Stack

[Openstack-operators] [HA] RFC: user story including hypervisor reservation / host maintenance / storage AZs / event history (fwd)

OpenStack

Community

Documentation

Branding & Legal