[Product] [HA] RFC: user story including hypervisor reservation / host maintenance / storage AZs / event history

Adam Spiers aspiers at suse.com
Tue Jun 7 23:19:52 UTC 2016


[Cc'ing product-wg@ - when replying, first please consider whether
cross-posting is appropriate]

Hi all,

Currently the OpenStack HA community is putting a lot of effort into
converging on a single upstream solution for high availability of VMs
and hypervisors[0], and we had a lot of very productive discussions in
Austin on this topic[1].

One of the first areas of focus is the high level user story:

   http://specs.openstack.org/openstack/openstack-user-stories/user-stories/proposed/ha_vm.html

In particular, there is an open review on which we could use some
advice from the wider community.  The review proposes adding four
extra usage scenarios to the existing user story.  All of these
scenarios are to some degree related to HA of VMs and hypervisors,
however none of them exclusively - they all have scope extending to
other areas beyond HA.  Here's a very brief summary of all four, as
they relate to HA:

1. "Sticky" shared storage zones

   Scenario: all compute hosts have access to exactly one shared
   storage "availability zone" (potentially independent of the normal
   availability zones).  For example, there could be multiple NFS
   servers, and every compute host has /var/lib/nova/instances mounted
   to one of them.  On first boot, each VM is *implicitly* assigned to
   a zone, depending on which compute host nova-scheduler picks for it
   (so this could be more or less random).  Subsequent operations such
   as "nova evacuate" would need to ensure the VM only ever moves to
   other hosts in the same zone.

2. Hypervisor reservation

   The operator wants a mechanism for reserving some compute hosts
   exclusively for use as failover hosts on which to automatically
   resurrect VMs from other failed compute nodes.

3. Host maintenance

   The operator wants a mechanism for flagging hosts as undergoing
   maintenance, so that the HA mechanisms for automatic recovery are
   temporarily disabled during the maintenance window.

4. Event history

   The operator wants a way to retrieve the history of what, when,
   where and how the HA automatic recovery mechanism is performed.

And here's the review in question:

   https://review.openstack.org/#/c/318431/

My first instinct was that all of these scenarios are sufficiently
independent, complex, and extend far enough outside HA scope, that
they deserve to live in four separate user stories, rather than adding
them to our existing "HA for VMs" user story.  This could also
maximise the chances of converging on a single upstream solution for
each which works both inside and outside HA contexts.  (Please read
the review's comments for much more detail on these arguments.)

However, others made the very valid point that since there are
elements of all these stories which are indisputably related to HA for
VMs, we still need the existing user story for HA VMs to cover them,
so that it can provide "the big picture" which will tie together all
the different strands of work it requires.

So we are currently proposing to take the following steps:

 - Propose four new user stories for each of the above scenarios.

 - Link to the new stories from the "Related User Stories" section of
   the existing HA VMs story.

 - Extend the existing story so that it covers the HA-specific aspects of
   the four cases, leaving any non-HA aspects to be covered by the newly
   linked stories.

Then each story would go through the standard workflow defined by the PWG:

   https://wiki.openstack.org/wiki/ProductTeam/User_Stories

Does this sound reasonable, or is there a better way?

BTW, whilst this email is primarily asking for advice on the process,
feedback on each story is also welcome, whether it's "good idea", "you
can already do that", or "terrible idea!" ;-)  However please first
read the comments on the above review, as the obvious points have
probably already been covered :-)

Thanks a lot!

Adam

[0] A complete description of the problem area and existing solutions
    was given in this talk:

      https://www.openstack.org/videos/video/high-availability-for-pets-and-hypervisors-state-of-the-nation

[1] https://etherpad.openstack.org/p/newton-instance-ha



More information about the Product-wg mailing list