[Openstack-operators] [HA] RFC: user story including hypervisor reservation / host maintenance / storage AZs / event history (fwd)

Adam Spiers aspiers at suse.com
Mon Jun 13 09:18:56 UTC 2016


Hi all,

Apologies for not thinking to Cc this openstack-operators list first
time round when I sent the below mail!  It concerns four usage
scenarios which all principally involve cloud operators, so with
hindsight that was a really stupid omission :-/

I would be very interested to hear both:

  a) whether you think our proposal to create four new user stories
     for each of these makes sense, and

  b) feedback on any of the individual usage scenarios.

Thanks a lot!
Adam

----- Forwarded message from Adam Spiers <aspiers at suse.com> -----

Date: Wed, 8 Jun 2016 00:19:52 +0100
From: Adam Spiers <aspiers at suse.com>
To: openstack-dev mailing list <openstack-dev at lists.openstack.org>
Cc: OpenStack Product Working Group list <product-wg at lists.openstack.org>
Subject: [openstack-dev] [HA] RFC: user story including hypervisor reservation / host maintenance / storage AZs / event history
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>

[Cc'ing product-wg@ - when replying, first please consider whether
cross-posting is appropriate]

Hi all,

Currently the OpenStack HA community is putting a lot of effort into
converging on a single upstream solution for high availability of VMs
and hypervisors[0], and we had a lot of very productive discussions in
Austin on this topic[1].

One of the first areas of focus is the high level user story:

   http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html

In particular, there is an open review on which we could use some
advice from the wider community.  The review proposes adding four
extra usage scenarios to the existing user story.  All of these
scenarios are to some degree related to HA of VMs and hypervisors,
however none of them exclusively - they all have scope extending to
other areas beyond HA.  Here's a very brief summary of all four, as
they relate to HA:

1. "Sticky" shared storage zones

   Scenario: all compute hosts have access to exactly one shared
   storage "availability zone" (potentially independent of the normal
   availability zones).  For example, there could be multiple NFS
   servers, and every compute host has /var/lib/nova/instances mounted
   to one of them.  On first boot, each VM is *implicitly* assigned to
   a zone, depending on which compute host nova-scheduler picks for it
   (so this could be more or less random).  Subsequent operations such
   as "nova evacuate" would need to ensure the VM only ever moves to
   other hosts in the same zone.

2. Hypervisor reservation

   The operator wants a mechanism for reserving some compute hosts
   exclusively for use as failover hosts on which to automatically
   resurrect VMs from other failed compute nodes.

3. Host maintenance

   The operator wants a mechanism for flagging hosts as undergoing
   maintenance, so that the HA mechanisms for automatic recovery are
   temporarily disabled during the maintenance window.

4. Event history

   The operator wants a way to retrieve the history of what, when,
   where and how the HA automatic recovery mechanism is performed.

And here's the review in question:

   https://review.openstack.org/#/c/318431/

My first instinct was that all of these scenarios are sufficiently
independent, complex, and extend far enough outside HA scope, that
they deserve to live in four separate user stories, rather than adding
them to our existing "HA for VMs" user story.  This could also
maximise the chances of converging on a single upstream solution for
each which works both inside and outside HA contexts.  (Please read
the review's comments for much more detail on these arguments.)

However, others made the very valid point that since there are
elements of all these stories which are indisputably related to HA for
VMs, we still need the existing user story for HA VMs to cover them,
so that it can provide "the big picture" which will tie together all
the different strands of work it requires.

So we are currently proposing to take the following steps:

 - Propose four new user stories for each of the above scenarios.

 - Link to the new stories from the "Related User Stories" section of
   the existing HA VMs story.

 - Extend the existing story so that it covers the HA-specific aspects of
   the four cases, leaving any non-HA aspects to be covered by the newly
   linked stories.

Then each story would go through the standard workflow defined by the PWG:

   https://wiki.openstack.org/wiki/ProductTeam/User_Stories

Does this sound reasonable, or is there a better way?

BTW, whilst this email is primarily asking for advice on the process,
feedback on each story is also welcome, whether it's "good idea", "you
can already do that", or "terrible idea!" ;-)  However please first
read the comments on the above review, as the obvious points have
probably already been covered :-)

Thanks a lot!

Adam

[0] A complete description of the problem area and existing solutions
    was given in this talk:

      https://www.openstack.org/videos/video/high-availability-for-pets-and-hypervisors-state-of-the-nation

[1] https://etherpad.openstack.org/p/newton-instance-ha

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


----- End forwarded message -----



More information about the OpenStack-operators mailing list