<p dir="ltr">This does not cover all your issues but after seeing mysql bugs between I and J and also J to K we now export and restore production control plane data into a dev environment to test the upgrades. If we have issues we destroy this environment and run it again. </p>

<p dir="ltr">For longer running instances that's tough but we try to catch those in our shared dev environment or staging with regression tests. This is also where we catch issues with outside hardware interactions like load balancers and storage. </p>

<p dir="ltr">For your other issue was there a warning or depreciation in the logs for that? That's always at the top of our checklist. <br>

</p>

<div class="gmail_extra"><br><div class="gmail_quote">On Oct 17, 2016 12:51 PM, "Jonathan Proulx" <<a href="mailto:jon@csail.mit.edu">jon@csail.mit.edu</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi All,<br>

<br>

Just on the other side of a Kilo->Mitaka upgrade (with a very brief<br>

transit through Liberty in the middle).<br>

<br>

As usual I've caught a few problems in production that I have no idea<br>

how I could possibly have tested for because they relate to older<br>

running instances and some remnants of older package versions on the<br>

production side which wouldn't have existed in test unless I'd<br>

installed the test server with Havana and done incremental upgrades<br>

starting a fairly wide suite of test instances along the way.<br>

<br>

First thing that bit me was neutron-db-manage being confused because<br>

my production system still had migrations from Havana hanging around.<br>

I'm calling this a packaging bug<br>

<a href="https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1633576" rel="noreferrer" target="_blank">https://bugs.launchpad.net/<wbr>ubuntu/+source/neutron/+bug/<wbr>1633576</a> but I<br>

also feel like remembering release names forever might be a good<br>

thing.<br>

<br>

Later I discovered during the Juno release (maybe earlier ones too)<br>

making snapshot of running instances populated the snapshot's meta<br>

data with "instance_type_vcpu_weight: none".  Currently (Mitaka) this<br>

value must be an integer if it is set or boot fails.  This has the<br>

interesting side effect of putting your instance into shutdown/error<br>

state if you try a hard reboot of a formerly working instance.  I<br>

'fixed' this manually frobbing the DB to set lines where<br>

instance_type_vcpu_weight was set to none to be deleted.<br>

<br>

Does anyone have strategies on how to actually test for problems with<br>

"old" artifacts like these?<br>

<br>

Yes having things running from 18-24month old snapshots is "bad" and<br>

yes it would be cleaner to install a fresh control plane at each<br>

upgrade and cut over rather than doing an actual in place upgrade.  But<br>

neither of these sub-optimal patterns are going all the way away<br>

anytime soon.<br>

<br>

-Jon<br>

<br>

--<br>

<br>

______________________________<wbr>_________________<br>

OpenStack-operators mailing list<br>

<a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.<wbr>openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-operators</a><br>

</blockquote></div></div>