[Openstack] My diablo to essex upgrade process (was: A plea from an OpenStack user)

Ryan Lane rlane at wikimedia.org
Tue Aug 28 23:52:18 UTC 2012


> It would be fascinating (for me at least :)) to know the upgrade
> process you use - how many stages you use, do you have multiple
> regions and use one/some as canaries? Does the downtime required to do
> an upgrade affect you? Do you run skewed versions (e.g. folsom nova,
> essex glance) or do you do lock-step upgrades of all the components?
>

This was a particularly difficult upgrade, since we needed to change
so many things at once.

We did a lock-step upgrade this time around. Keystone basically
required that. As far as I could tell, if you enable keystone for
nova, you must enable it for glance. Also, I know that the components
are well tested for compatibility within the same release, so I
thought it would be best to not include any extra complications.

I did my initial testing in a project within my infrastructure (hooray
for inception). After everything worked in a testing set up and was
puppetized, I tested on production hardware. I'm preparing a region in
a new datacenter, so this time I used that hardware for
production-level testing. In the future we're going to set aside a
small amount of cheap-ish hardware for production-level testing.

This upgrade required an operating system upgrade as well. I took the
following steps for the actual upgrade:

1. Backed up all databases, and LDAP
2. Disabled the OpenStackManager extension in the controller's wiki
(we have a custom interface integrated with MediaWiki)
3. Turned off all openstack services
4. Made the required LDAP changes needed for Keystone's backend
5. Upgraded the controller to precise, then made required changes (via
puppet), which includes installing/configuring keystone
6. Upgraded the glance and nova databases
7. Upgraded the network node to precise, then made required changes
(via puppet) - this caused network downtime for a few minutes during
the reboot and puppet run
8. Upgraded a compute node that wasn't in use to precise, made
required changes (via puppet), and tested instance creation and
networking
9. Upgraded a compute node that was in use, rebooted a couple
instances to ensure they'd start properly and have proper networking,
then rebooted all instances on the node
10. Upgraded the remaining compute nodes and rebooted their instances

I had notes on how to rollback during various phases of the upgrade.
This was mostly moving services to different nodes.

Downtime was required because of the need to change OS releases. That
said, my environment is mostly test and development and some
semi-production uses that can handle downtime, so I didn't put a large
amount of effort into completely avoiding downtime.

> For Launchpad we've been moving more and more to a model of permitting
> temporary skew so that we can do rolling upgrades of the component
> services. That seems in-principle doable here - and could make it
> easier to smoothly transition between versions, at the cost of a
> (small) amount of attention to detail while writing changes to the
> various apis.
>

Right now it's not possible to run multiple versions of openstack
services as far as I know. It would be ideal to be able to run all
folsom and grizzly services (for instance) side-by-side while the
upgrade is occurring. At minimum it would be nice for the next release
to be able to use the old release's schema so that upgrades can be
attempted in a way that's much easier to rollback.

- Ryan




More information about the Openstack mailing list