[openstack-dev] [ptl][tc] Accessible upgrade support
sean at dague.net
Thu Oct 5 11:42:16 UTC 2017
On 10/05/2017 07:08 AM, Graham Hayes wrote:
> On Thu, 5 Oct 2017, at 09:50, Thierry Carrez wrote:
>> Matt Riedemann wrote:
>>> What's the difference between this tag and the zero-impact-upgrades tag?
>>> I guess the accessible one is, can a user still ssh into their VM while
>>> the nova compute service is being upgraded. The zero-impact-upgrade one
>>> is more to do with performance degradation during an upgrade. I'm not
>>> entirely sure what that might look like, probably need operator input.
>>> For example, while upgrading, you're live migrating VMs all over the
>>> place which is putting extra strain on the network.
>> The zero-impact-upgrade tag means no API downtime and no measurable
>> impact on performance, while the accessible-upgrade means that while
>> there can be API downtime, the resources provisioned are still
>> accessible (you can use the VM even if nova-api is down).
>> I still think we have too many of those upgrade tags, and amount of
>> information they provide does not compensate the confusion they create.
>> If you're not clear on what they mean, imagine a new user looking at the
>> Software Navigator...
>> In particular, we created two paths in the graph:
>> * upgrade < accessible-upgrade
>> * upgrade < rolling-upgrade < zero-downtime < zero-impact
>> I personally would get rid of zero-impact (not sure there is that much
>> additional information it conveys beyond zero-downtime).
>> If we could make the requirements of accessible-upgrade a part of
>> rolling-upgrade, that would also help (single path in the graph, only 3
>> "levels"). Is there any of the current rolling-upgrade things (cinder,
>> neutron, nova, swift) that would not qualify for accessible-upgrade as
>> well ?
> Well, there is projects (like designate) that qualify for accessible
> upgrade, but not rolling upgrade.
The neutron story is mixed on accessable upgrade, because at least in
some cases, like ovs, upgrade might trigger a network tear down /
rebuild that generates an outage (though typically a pretty small one).
I still think it's hard to describe to folks what is going on without
pictures. And the tag structure might just be the wrong way to describe
the world, because they are a set of positive assertions, and upgrade
expectations are really about: "how terrible will this be".
If I was an operator the questions I might have is:
1) Really basic, will my db roll forward?
2) When my db rolls forward, is it going to take a giant table lock that
is effectively an outage?
3) Is whatever date I created, computes, networks going to stay up when
I do all this? (i.e. no customer workload interuption)
4) If the service is more than 1 process, can they arbitrarily work with
N-1 so I won't have a closet outage when services restart.
5) If the service runs on more than 1 host, can I mix host levels, or
will there be an outage as I upgrade nodes
6) If the service talks to other openstack services, is there a strict
version lock in which means I've got to coordinate with those for
upgrade? If so, what order is that and is it clear?
7) Can I seamlessly hide my API upgrade behind HA-Proxy / Istio / (or
similar) so that there is no API service interruption
8) Is there any substantial degradation in running "mixed mode" even if
it's supported, so that I know whether I can do this over a longer
window of time when time permits
9) What level of validation exists to ensure that any of these "should
work" do work?
The tags were really built around grouping a few of these, but even with
folks that are near the problem, they got confusing quick. I really
think that some more pictoral upgrade safety cards or something
explaining the things you need to consider, and what parts projects
handle for you would be really useful. And then revisit whatever the
tagging structure is going to be later.
More information about the OpenStack-dev