[openstack-dev] [nova] Rocky RC time regression analysis
melanie witt
melwittt at gmail.com
Fri Oct 5 23:59:28 UTC 2018
Hey everyone,
During our Rocky retrospective discussion at the PTG [1], we talked
about the spec freeze deadline (milestone 2, historically it had been
milestone 1) and whether or not it was related to the hectic
late-breaking regression RC time we had last cycle. I had an action item
to go through the list of RC time bugs [2] and dig into each one,
examining: when the patch that introduced the bug landed vs when the bug
was reported, why it wasn't caught sooner, and report back so we can
take a look together and determine whether they were related to the spec
freeze deadline.
I used this etherpad to make notes [3], which I will [mostly] copy-paste
here. These are all after RC1 and I'll paste them in chronological order
of when the bug was reported.
Milestone 1 r-1 was 2018-04-19.
Spec freeze was milestone 2 r-2 was 2018-06-07.
Feature freeze (FF) was on 2018-07-26.
RC1 was on 2018-08-09.
1) Broken live migration bandwidth minimum => maximum based on neutron
event https://bugs.launchpad.net/nova/+bug/1786346
- Bug was reported on 2018-08-09, the day of RC1
- The patch that caused the regression landed on 2018-03-30
https://review.openstack.org/497457
- Unrelated to a blueprint, the regression was part of a bug fix
- Was found because prometheanfire was doing live migrations and noticed
they seemed to be stuck at 1MiB/s for linuxbridge VMs
- The bug was due to a race, so the gate didn't hit it
- Comment on the regression bug from dansmith: "The few hacked up gate
jobs we used to test this feature at merge time likely didn't notice the
race because the migrations finished before the potential timeout and/or
are on systems so loaded that the neutron event came late enough for us
to win the race repeatedly."
2) Docs for the zvm driver missing
- All zvm driver code changes were merged by 2018-07-17 but the
documentation was overlooked but was noticed near RC time
- Blueprint was approved on 2018-02-12
3) Volume status remains "detaching" after a failure to detach a volume
due to DeviceDetachFailed https://bugs.launchpad.net/nova/+bug/1786318
- Bug was reported on 2018-08-09, the day of RC1
- The change that introduced the regression landed on 2018-02-21
https://review.openstack.org/546423
- Unrelated to a blueprint, the regression was part of a bug fix
- Question: why wasn't this caught earlier?
- Answer: Unit tests were not asserting the call to the roll_detaching
volume API. Coverage has since been added along with the bug fix
https://review.openstack.org/590439
4) OVB overcloud deploy fails on nova placement errors
https://bugs.launchpad.net/nova/+bug/1787910
- Bug was reported on 2018-08-20
- Change that caused the regression landed on 2018-07-26, FF day
https://review.openstack.org/517921
- Blueprint was approved on 2018-05-16
- Was found because of a failure in the
legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master
CI job. The ironic-inspector CI upstream also failed because of this, as
noted by dtantsur.
- Question: why did it take nearly a month for the failure to be
noticed? Is there any way we can cover this in our
ironic-tempest-dsvm-ipa-wholedisk-bios-agent_ipmitool-tinyipa job?
5) when live migration fails due to a internal error rollback is not
handled correctly https://bugs.launchpad.net/nova/+bug/1788014
- Bug was reported on 2018-08-20
- The change that caused the regression landed on 2018-07-26, FF day
https://review.openstack.org/434870
- Unrelated to a blueprint, the regression was part of a bug fix
- Was found because sean-k-mooney was doing live migrations and found
that when a LM failed because of a QEMU internal error, the VM remained
ACTIVE but the VM no longer had network connectivity.
- Question: why wasn't this caught earlier?
- Answer: We would need a live migration job scenario that intentionally
initiates and fails a live migration, then verify network connectivity
after the rollback occurs.
- Question: can we add something like that?
6) nova-manage db online_data_migrations hangs on instances with no host
set https://bugs.launchpad.net/nova/+bug/1788115
- Bug was reported on 2018-08-21
- The patch that introduced the bug landed on 2018-05-30
https://review.openstack.org/567878
- Unrelated to a blueprint, the regression was part of a bug fix
- Question: why wasn't this caught earlier?
- Answer: To hit the bug, you had to have had instances with no host set
(that failed to schedule) in your database during an upgrade. This does
not happen during the grenade job
- Question: could we add anything to the grenade job that would leave
some instances with no host set to cover cases like this?
7) release notes erroneously say that nova-consoleauth doesn't have to
run in Rocky https://bugs.launchpad.net/nova/+bug/1788470
- Bug was reported on 2018-08-22
- The patches that conveyed the wrong information for the docs landed on
2018-05-07 https://review.openstack.org/565367
- Blueprint was approved on 2018-03-12
- Question: why wasn't this caught earlier?
- Answer: The patches should have been tested by a devstack change that
runs without the nova-consoleauth service, to verify the system can work
without the service.
- Question: can we add test coverage for that?
- Answer: Yes, it's proposed as a WIP https://review.openstack.org/607070
8) libvirt driver rx_queue_size changes break SR-IOV
https://bugs.launchpad.net/nova/+bug/1789074
- Bug was reported on 2018-08-25
- The change that caused the regression landed on 2018-04-23
https://review.openstack.org/484997
- Blueprint was approved on 2018-03-23
- Was found because moshele tried to create a server with an SRIOV
interface (PF or VF) with rx_queue_size and tx_queue_size set in
nova.conf and it failed.
- Question: why wasn't this caught earlier?
- Answer: Exposing the bug required both setting rx/tx queue sizes and
booting a server with a SRIOV interface. We don't have hardware for
testing SRIOV in the gate.
- Question: is there any other way to test this via functional tests?
From what I understand, there isn't.
Based on all of this, I don't believe the spec freeze at milestone 2 was
related to the late-breaking regressions we had around RC time. We
approved 10 additional blueprints between r-1 2018-04-19 and r-2
2018-06-07 [4]. Half of the regressions were unrelated to feature work
and were introduced as part of bug fixes. Of the other half, 3 out of 4
had blueprints approved before r-1. Only one involved a blueprint
approved after r-1.
In a couple of cases, the patch that introduced the regression landed on
feature freeze day, with the bugs being found about a month later, which
was about two weeks after RC1. In most cases, the regression landed
months before the bug was found, because of lack of test coverage.
It seems like if we could do anything about this, it would be to move
feature freeze day sooner so we have more time between FF and RC. But I
have a feeling that some of the bugs get found when people take the RC
and try it out.
Based on what I've found here, I think we are fine to stick with using
milestone 2 as the spec freeze date. And we might want to consider
moving feature freeze day sooner to give more time between feature
freeze and RC. This cycle, even though it is a longer cycle, we still
have only 2 weeks between s-3 and rc1 [5].
I'm ambivalent about changing the usual milestone 3 feature freeze date
because I have a feeling that people tend to try things out once RC is
released, but maybe I'm wrong on that. What are your thoughts?
Finally, please do jump in and reply if you have info to share or
questions to ask about the regressions listed above. I'm especially
interested in getting answers to the questions I posed earlier inline
about whether there's anything we can do to cover some of the cases with
our CI.
Cheers,
-melanie
[1] https://etherpad.openstack.org/p/nova-rocky-retrospective
[2] https://etherpad.openstack.org/p/nova-rocky-release-candidate-todo
[3] https://etherpad.openstack.org/p/nova-rocky-rc-regression-analysis
[4]
https://docs.google.com/spreadsheets/d/e/2PACX-1vQicKStmnQFcOdnZU56ynJmn8e0__jYsr4FWXs3GrDsDzg1hwHofvJnuSieCH3ExbPngoebmEeY0waH/pubhtml?gid=128173249&single=true
[5] https://wiki.openstack.org/wiki/Nova/Stein_Release_Schedule
More information about the OpenStack-dev
mailing list