[Openstack-operators] [upgrades][skip-level][leapfrog] - RFC - Skipping releases when upgrading

Matteo Panella matteo.panella at cnaf.infn.it
Fri May 26 07:15:01 UTC 2017


Warning: wall of text incoming :-)

On 26/05/2017 03:55, Carter, Kevin wrote:
> If you've taken on an adventure like this how did you approach
> it? Did it work? Any known issues, gotchas, or things folks should be
> generally aware of?

We're fresh out of a Juno-to-Mitaka upgrade. It worked, but it required
significant downtime of the user VMs for an OS upgrade on all compute
nodes (we fell behind CentOS update schedule due to some code requiring
specific kernel versions, so we could not perform a no-downtime upgrade
even though we're using LinuxBridge for the data plane).

We took a significant amount of time to automate almost everything (OS
updates, OpenStack updates and configuration management), but the
control plane migration was performed manually with a lot of
verification steps to ensure the databases would not end up in shambles
(the procedure was carefully written in a runbook and tested on a
separate testbed and on a snapshot of all production databases).

As I said, the update worked but we hit a few snags:
1. glance and neutron DBs were created with latin1 as default charset,
so we had to convert both to UTF8 (dump, iconv, fix the definition,
restore) - this is an operational issue on our side, though
2. on the testbed we found that nova created duplicated entries for all
hypervisors after starting all services, we traced that down to
compute_nodes.host being NULL for all HVs
3. [cache]/enable in nova.conf *must* be set to true if there are
multiple instances of nova-consoleauth/nova-novncproxy, in previous
releases we'd just point nova to our memcache servers and it would work
(probably we overlooked something in the docs)

> During our chat today we generally landed on an in-place upgrade with
> known API service downtime and little (at least as little as possible)
> data plane downtime. The process discussed was basically:
> a1. Create utility "thing-a-me" (container, venv, etc) which contains
> the required code to run a service through all of the required upgrades.
> a2. Stop service(s).
> a3. Run migration(s)/upgrade(s) for all releases using the utility
> "thing-a-me".
> a4. Repeat for all services.
> 
> b1. Once all required migrations are complete run a deployment using the
> target release.
> b2. Ensure all services are restarted.
> b3. Ensure cloud is functional.
> b4. profit!

That was our basic workflow, except the "thing-a-me" was myself :-)

Joking aside, we kept one controller host out of the "mass upgrade" loop
and carefully performed single-version upgrades of the packages, running
all required DB migrations for each version.

> Also, the tooling is not very general purpose or portable outside of OSA
> but it could serve as a guide or just a general talking point.> Are there other tools out there that solve for the multi-release upgrade?

Not that I know. AFAIR, the BlueBox guys (now IBM) had some
Ansible-based tooling for automating a single-version upgrade, I don't
know if they ever considered skip-level upgrades.

> Best practices?

1. automate as much as possible
2. use a configuration management tool to deploy the final configuration
to all nodes (Puppet, Ansible, Chef...)
3. have a testing environment which resembles *as closely as possible*
the production environment
4. simulate all migrations on a snapshot of all production databases to
catch any issue early

> Do folks believe tools are the right way to solve this or would
> comprehensive upgrade documentation be better for the general community?

Both, actually. A generic upgrade tool would need to cover *a lot* of
deployment scenarios, so it would probably end up being a "reference
implementation" only.

Comprehensive skip-level upgrade documentation would be optimal (in our
case we had to rebuild Kilo and Liberty docs from sources).

> As most of the upgrade issues center around database migrations, we
> discussed some of the potential pitfalls at length. One approach was to
> roll-up all DB migrations into a single repository and run all upgrades
> for a given project in one step. Another was to simply have mutliple
> python virtual environments and just run in-line migrations from a
> version specific venv (this is what the OSA tooling does). Does one way
> work better than the other? Any thoughts on how this could be better?
> Would having N+2/3 migrations addressable within the projects, even if
> they're not tested any longer, be helpful?

Some projects apparently keep shipping all migrations, even though
they're not supported.

> It was our general thought that folks would be interested in having the
> ability to skip releases so we'd like to hear from the community to
> validate our thinking.

That's good to know :-)

-- 
Matteo Panella
INFN CNAF
Via Ranzani 13/2 c - 40127 Bologna, Italy
Phone: +39 051 609 2903

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2264 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20170526/62f21686/attachment.bin>


More information about the OpenStack-operators mailing list