[Openstack-operators] How do you even test for that?
Jonathan D. Proulx
jon at csail.mit.edu
Tue Oct 18 14:00:32 UTC 2016
On Mon, Oct 17, 2016 at 05:45:07PM -0600, Matt Fischer wrote:
:This does not cover all your issues but after seeing mysql bugs between I
:and J and also J to K we now export and restore production control plane
:data into a dev environment to test the upgrades. If we have issues we
:destroy this environment and run it again.
Yeah I learned that one the hard way a while back (maybe havana?), you
ever revert a production OpenStack upgrade ;)
A copy of the production DB goes into test pretty much immediately
prior to upgrade tests.
:For longer running instances that's tough but we try to catch those in our
:shared dev environment or staging with regression tests. This is also where
:we catch issues with outside hardware interactions like load balancers and
:storage.
:
:For your other issue was there a warning or depreciation in the logs for
:that? That's always at the top of our checklist.
Not that I saw or could find post facto on the controllers or
hypervisors nova and glance logs.
But it's not so much the specific issue which is dealt with as the class of
issue. Compatibility of artifacts created under Latest-N where N>1
It entirely possible there isn't a good way to test. I mean I can't
think of one but I know some of you out there are smarter than me so
hope springs eternal.
Perhaps designing better post upgrade validation that focuses on
oldest artifacts, or various generations of them is the best I can
hope for. At least then Ops would catch them and start working on a
fix ASAP.
-Jon
:On Oct 17, 2016 12:51 PM, "Jonathan Proulx" <jon at csail.mit.edu> wrote:
:
:> Hi All,
:>
:> Just on the other side of a Kilo->Mitaka upgrade (with a very brief
:> transit through Liberty in the middle).
:>
:> As usual I've caught a few problems in production that I have no idea
:> how I could possibly have tested for because they relate to older
:> running instances and some remnants of older package versions on the
:> production side which wouldn't have existed in test unless I'd
:> installed the test server with Havana and done incremental upgrades
:> starting a fairly wide suite of test instances along the way.
:>
:> First thing that bit me was neutron-db-manage being confused because
:> my production system still had migrations from Havana hanging around.
:> I'm calling this a packaging bug
:> https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1633576 but I
:> also feel like remembering release names forever might be a good
:> thing.
:>
:> Later I discovered during the Juno release (maybe earlier ones too)
:> making snapshot of running instances populated the snapshot's meta
:> data with "instance_type_vcpu_weight: none". Currently (Mitaka) this
:> value must be an integer if it is set or boot fails. This has the
:> interesting side effect of putting your instance into shutdown/error
:> state if you try a hard reboot of a formerly working instance. I
:> 'fixed' this manually frobbing the DB to set lines where
:> instance_type_vcpu_weight was set to none to be deleted.
:>
:> Does anyone have strategies on how to actually test for problems with
:> "old" artifacts like these?
:>
:> Yes having things running from 18-24month old snapshots is "bad" and
:> yes it would be cleaner to install a fresh control plane at each
:> upgrade and cut over rather than doing an actual in place upgrade. But
:> neither of these sub-optimal patterns are going all the way away
:> anytime soon.
:>
:> -Jon
:>
:> --
:>
:> _______________________________________________
:> OpenStack-operators mailing list
:> OpenStack-operators at lists.openstack.org
:> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
:>
--
More information about the OpenStack-operators
mailing list