[openstack-dev] [nova] Rocky RC time regression analysis

melanie witt melwittt at gmail.com
Fri Oct 5 23:59:28 UTC 2018


Hey everyone,

During our Rocky retrospective discussion at the PTG [1], we talked 
about the spec freeze deadline (milestone 2, historically it had been 
milestone 1) and whether or not it was related to the hectic 
late-breaking regression RC time we had last cycle. I had an action item 
to go through the list of RC time bugs [2] and dig into each one, 
examining: when the patch that introduced the bug landed vs when the bug 
was reported, why it wasn't caught sooner, and report back so we can 
take a look together and determine whether they were related to the spec 
freeze deadline.

I used this etherpad to make notes [3], which I will [mostly] copy-paste 
here. These are all after RC1 and I'll paste them in chronological order 
of when the bug was reported.

Milestone 1 r-1 was 2018-04-19.
Spec freeze was milestone 2 r-2 was 2018-06-07.
Feature freeze (FF) was on 2018-07-26.
RC1 was on 2018-08-09.

1) Broken live migration bandwidth minimum => maximum based on neutron 
event https://bugs.launchpad.net/nova/+bug/1786346

- Bug was reported on 2018-08-09, the day of RC1
- The patch that caused the regression landed on 2018-03-30 
https://review.openstack.org/497457
- Unrelated to a blueprint, the regression was part of a bug fix
- Was found because prometheanfire was doing live migrations and noticed 
they seemed to be stuck at 1MiB/s for linuxbridge VMs
- The bug was due to a race, so the gate didn't hit it
- Comment on the regression bug from dansmith: "The few hacked up gate 
jobs we used to test this feature at merge time likely didn't notice the 
race because the migrations finished before the potential timeout and/or 
are on systems so loaded that the neutron event came late enough for us 
to win the race repeatedly."

2) Docs for the zvm driver missing

- All zvm driver code changes were merged by 2018-07-17 but the 
documentation was overlooked but was noticed near RC time
- Blueprint was approved on 2018-02-12

3) Volume status remains "detaching" after a failure to detach a volume 
due to DeviceDetachFailed https://bugs.launchpad.net/nova/+bug/1786318

- Bug was reported on 2018-08-09, the day of RC1
- The change that introduced the regression landed on 2018-02-21 
https://review.openstack.org/546423
- Unrelated to a blueprint, the regression was part of a bug fix
- Question: why wasn't this caught earlier?
- Answer: Unit tests were not asserting the call to the roll_detaching 
volume API. Coverage has since been added along with the bug fix 
https://review.openstack.org/590439

4) OVB overcloud deploy fails on nova placement errors 
https://bugs.launchpad.net/nova/+bug/1787910

- Bug was reported on 2018-08-20
- Change that caused the regression landed on 2018-07-26, FF day 
https://review.openstack.org/517921
- Blueprint was approved on 2018-05-16
- Was found because of a failure in the 
legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master 
CI job. The ironic-inspector CI upstream also failed because of this, as 
noted by dtantsur.
- Question: why did it take nearly a month for the failure to be 
noticed? Is there any way we can cover this in our 
ironic-tempest-dsvm-ipa-wholedisk-bios-agent_ipmitool-tinyipa job?

5) when live migration fails due to a internal error rollback is not 
handled correctly https://bugs.launchpad.net/nova/+bug/1788014

- Bug was reported on 2018-08-20
- The change that caused the regression landed on 2018-07-26, FF day 
https://review.openstack.org/434870
- Unrelated to a blueprint, the regression was part of a bug fix
- Was found because sean-k-mooney was doing live migrations and found 
that when a LM failed because of a QEMU internal error, the VM remained 
ACTIVE but the VM no longer had network connectivity.
- Question: why wasn't this caught earlier?
- Answer: We would need a live migration job scenario that intentionally 
initiates and fails a live migration, then verify network connectivity 
after the rollback occurs.
- Question: can we add something like that?

6) nova-manage db online_data_migrations hangs on instances with no host 
set https://bugs.launchpad.net/nova/+bug/1788115

- Bug was reported on 2018-08-21
- The patch that introduced the bug landed on 2018-05-30 
https://review.openstack.org/567878
- Unrelated to a blueprint, the regression was part of a bug fix
- Question: why wasn't this caught earlier?
- Answer: To hit the bug, you had to have had instances with no host set 
(that failed to schedule) in your database during an upgrade. This does 
not happen during the grenade job
- Question: could we add anything to the grenade job that would leave 
some instances with no host set to cover cases like this?

7) release notes erroneously say that nova-consoleauth doesn't have to 
run in Rocky https://bugs.launchpad.net/nova/+bug/1788470

- Bug was reported on 2018-08-22
- The patches that conveyed the wrong information for the docs landed on 
2018-05-07 https://review.openstack.org/565367
- Blueprint was approved on 2018-03-12
- Question: why wasn't this caught earlier?
- Answer: The patches should have been tested by a devstack change that 
runs without the nova-consoleauth service, to verify the system can work 
without the service.
- Question: can we add test coverage for that?
- Answer: Yes, it's proposed as a WIP https://review.openstack.org/607070

8) libvirt driver rx_queue_size changes break SR-IOV 
https://bugs.launchpad.net/nova/+bug/1789074

- Bug was reported on 2018-08-25
- The change that caused the regression landed on 2018-04-23 
https://review.openstack.org/484997
- Blueprint was approved on 2018-03-23
- Was found because moshele tried to create a server with an SRIOV 
interface (PF or VF) with rx_queue_size and tx_queue_size set in 
nova.conf and it failed.
- Question: why wasn't this caught earlier?
- Answer: Exposing the bug required both setting rx/tx queue sizes and 
booting a server with a SRIOV interface. We don't have hardware for 
testing SRIOV in the gate.
- Question: is there any other way to test this via functional tests? 
 From what I understand, there isn't.

Based on all of this, I don't believe the spec freeze at milestone 2 was 
related to the late-breaking regressions we had around RC time. We 
approved 10 additional blueprints between r-1 2018-04-19 and r-2 
2018-06-07 [4]. Half of the regressions were unrelated to feature work 
and were introduced as part of bug fixes. Of the other half, 3 out of 4 
had blueprints approved before r-1. Only one involved a blueprint 
approved after r-1.

In a couple of cases, the patch that introduced the regression landed on 
feature freeze day, with the bugs being found about a month later, which 
was about two weeks after RC1. In most cases, the regression landed 
months before the bug was found, because of lack of test coverage.

It seems like if we could do anything about this, it would be to move 
feature freeze day sooner so we have more time between FF and RC. But I 
have a feeling that some of the bugs get found when people take the RC 
and try it out.

Based on what I've found here, I think we are fine to stick with using 
milestone 2 as the spec freeze date. And we might want to consider 
moving feature freeze day sooner to give more time between feature 
freeze and RC. This cycle, even though it is a longer cycle, we still 
have only 2 weeks between s-3 and rc1 [5].

I'm ambivalent about changing the usual milestone 3 feature freeze date 
because I have a feeling that people tend to try things out once RC is 
released, but maybe I'm wrong on that. What are your thoughts?

Finally, please do jump in and reply if you have info to share or 
questions to ask about the regressions listed above. I'm especially 
interested in getting answers to the questions I posed earlier inline 
about whether there's anything we can do to cover some of the cases with 
our CI.

Cheers,
-melanie


[1] https://etherpad.openstack.org/p/nova-rocky-retrospective
[2] https://etherpad.openstack.org/p/nova-rocky-release-candidate-todo
[3] https://etherpad.openstack.org/p/nova-rocky-rc-regression-analysis
[4] 
https://docs.google.com/spreadsheets/d/e/2PACX-1vQicKStmnQFcOdnZU56ynJmn8e0__jYsr4FWXs3GrDsDzg1hwHofvJnuSieCH3ExbPngoebmEeY0waH/pubhtml?gid=128173249&single=true
[5] https://wiki.openstack.org/wiki/Nova/Stein_Release_Schedule



More information about the OpenStack-dev mailing list