[Openstack-operators] How do you even test for that?

Clint Byrum clint at fewbar.com
Tue Oct 18 17:03:15 UTC 2016


Excerpts from Jonathan Proulx's message of 2016-10-17 14:49:13 -0400:
> Hi All,
> 
> Just on the other side of a Kilo->Mitaka upgrade (with a very brief
> transit through Liberty in the middle).
> 
> As usual I've caught a few problems in production that I have no idea
> how I could possibly have tested for because they relate to older
> running instances and some remnants of older package versions on the
> production side which wouldn't have existed in test unless I'd
> installed the test server with Havana and done incremental upgrades
> starting a fairly wide suite of test instances along the way.
> 

In general, modifying _anything_ in place is hard to test.

You're much better off with as much immutable content as possible on all
of your nodes. If you've been wondering what this whole Docker nonsense
is about, well, that's what it's about. You docker build once per software
release attempt, and then mount data read/write, and configs readonly.
Both openstack-ansible and kolla are deployment projects that try to do
some of this via lxc or docker, IIRC.

This way when you test your container image in test, you copy it out to
prod, start up the new containers, stop the old ones, and you know that
_at least_ you don't have older stuff running anymore. Data and config
are still likely to be the source of issues, but there are other ways
to help test that.

> First thing that bit me was neutron-db-manage being confused because
> my production system still had migrations from Havana hanging around.
> I'm calling this a packaging bug
> https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1633576 but I
> also feel like remembering release names forever might be a good
> thing.
> 

Ouch, indeed one of the first things to do _before_ an upgrade is to run
the migrations of the current version to make sure your schema is up to
date. Also it's best to make sure you have _all_ of the stable updates
before you do that, since it's possible fixes have landed in the
migrations that are meant to smooth the upgrade process.

> Later I discovered during the Juno release (maybe earlier ones too)
> making snapshot of running instances populated the snapshot's meta
> data with "instance_type_vcpu_weight: none".  Currently (Mitaka) this
> value must be an integer if it is set or boot fails.  This has the
> interesting side effect of putting your instance into shutdown/error
> state if you try a hard reboot of a formerly working instance.  I
> 'fixed' this manually frobbing the DB to set lines where
> instance_type_vcpu_weight was set to none to be deleted.
> 

This one is tough because it is clearly data and state related. It's
hard to say how you got the 'none' values in there instead of ints.
Somebody else suggested making db snapshots and loading them into a test
control plane. That seems like an easy-ish one to do some surface level
finding, but the fact is it could also be super dangerous if not isolated
well, and the more isolation, the less of a real simulation it is.

> Does anyone have strategies on how to actually test for problems with
> "old" artifacts like these?
> 
> Yes having things running from 18-24month old snapshots is "bad" and
> yes it would be cleaner to install a fresh control plane at each
> upgrade and cut over rather than doing an actual in place upgrade.  But
> neither of these sub-optimal patterns are going all the way away
> anytime soon.
>

In-place upgrades must work. If they don't, please file bugs and
complain loudly. :)



More information about the OpenStack-operators mailing list