Open Stack

Tue Aug 28 21:26:12 UTC 2012

Yesterday I spent the day finally upgrading my nova infrastructure
from diablo to essex. I've upgraded from bexar to cactus, and cactus
to diablo, and now diablo to essex. Every single upgrade is becoming
more and more difficult. It's not getting easier, at all. Here's some
of the issues I ran into:

1. Glance changed from using image numbers to uuids for images. Nova's
reference to these weren't updated. There was no automated way to do
so. I had to map the old values to the new values from glance's
database then update them in nova.

2. Instance hostnames are changed every single release. In bexar and
cactus it was the ec2 style id. In diablo it was changed and hardcoded
to instance-<ec2-style-id>. In essex it is hardcoded to the instance
name; the instance's ID is configurable (with a default of
instance-<ec2-style-id>, but it only affects the name used in
virsh/the filesystem. I put a hack into diablo (thanks to Vish for
that hack) to fix the naming convention as to not break our production
deployment, but it only affected the hostnames in the database,
instances in virsh and on the filesystem were still named
instance-<ec2-style-id>, so I had to fix all libvirt definitions and
rename a ton of files to fix this during this upgrade, since our
naming convention is the ec2-style format. The hostname change still
affected our deployment, though. It's hardcoded. I decided to simply
switch hostnames to the instance name in production, since our
hostnames are required to be unique globally; however, that changes
how our puppet infrastructure works too, since the certname is by
default based on fqdn (I changed this to use the ec2-style id). Small
changes like this have giant rippling effects in infrastructures.

3. There used to be global groups in nova. In keystone there are no
global groups. This makes performing actions on sets of instances
across tenants incredibly difficult; for instance, I did an in-place
ubuntu upgrade from lucid to precise on a compute node, and needed to
reboot all instances on that host. There's no way to do that without
database queries fed into a custom script. Also, I have to have a
management user added to every single tenant and every single
tenant-role.

4. Keystone's LDAP implementation in stable was broken. It returned no
roles, many values were hardcoded, etc. The LDAP implementation in
nova worked, and it looks like its code was simply ignored when auth
was moved into keystone.

My plea is for the developers to think about how their changes are
going to affect production deployments when upgrade time comes.

It's fine that glance changed its id structure, but the upgrade should
have handled that. If a user needs to go into the database in their
deployment to fix your change, it's broken.

The constant hardcoded hostname changes are totally unacceptable; if
you change something like this it *must* be configurable, and there
should be a warning that the default is changing.

The removal of global groups was a major usability killer for users.
The removal of the global groups wasn't necessarily the problem,
though. The problem is that there were no alternative management
methods added. There's currently no reasonable way to manage the
infrastructure.

I understand that bugs will crop up when a stable branch is released,
but the LDAP implementation in keystone was missing basic
functionality. Keystone simply doesn't work without roles. I believe
this was likely due to the fact that the LDAP backend has basically no
tests and that Keystone light was rushed in for this release. It's
imperative that new required services at least handle the
functionality they are replacing, when released.

That said, excluding the above issues, my upgrade went fairly smoothly
and this release is *way* more stable and performs *way* better, so
kudos to the community for that. Keep up the good work!

- Ryan

Open Stack

[Openstack] A plea from an OpenStack user

OpenStack

Community

Documentation

Branding & Legal