[openstack-dev] [nova] Rocky RC time regression analysis
Eric Fried
openstack at fried.cc
Mon Oct 8 13:27:51 UTC 2018
Mel-
I don't have much of anything useful to add here, but wanted to say
thanks for this thorough analysis. It must have taken a lot of time and
work.
Musings inline.
On 10/05/2018 06:59 PM, melanie witt wrote:
> Hey everyone,
>
> During our Rocky retrospective discussion at the PTG [1], we talked
> about the spec freeze deadline (milestone 2, historically it had been
> milestone 1) and whether or not it was related to the hectic
> late-breaking regression RC time we had last cycle. I had an action item
> to go through the list of RC time bugs [2] and dig into each one,
> examining: when the patch that introduced the bug landed vs when the bug
> was reported, why it wasn't caught sooner, and report back so we can
> take a look together and determine whether they were related to the spec
> freeze deadline.
>
> I used this etherpad to make notes [3], which I will [mostly] copy-paste
> here. These are all after RC1 and I'll paste them in chronological order
> of when the bug was reported.
>
> Milestone 1 r-1 was 2018-04-19.
> Spec freeze was milestone 2 r-2 was 2018-06-07.
> Feature freeze (FF) was on 2018-07-26.
> RC1 was on 2018-08-09.
>
> 1) Broken live migration bandwidth minimum => maximum based on neutron
> event https://bugs.launchpad.net/nova/+bug/1786346
>
> - Bug was reported on 2018-08-09, the day of RC1
> - The patch that caused the regression landed on 2018-03-30
> https://review.openstack.org/497457
> - Unrelated to a blueprint, the regression was part of a bug fix
> - Was found because prometheanfire was doing live migrations and noticed
> they seemed to be stuck at 1MiB/s for linuxbridge VMs
> - The bug was due to a race, so the gate didn't hit it
> - Comment on the regression bug from dansmith: "The few hacked up gate
> jobs we used to test this feature at merge time likely didn't notice the
> race because the migrations finished before the potential timeout and/or
> are on systems so loaded that the neutron event came late enough for us
> to win the race repeatedly."
>
> 2) Docs for the zvm driver missing
>
> - All zvm driver code changes were merged by 2018-07-17 but the
> documentation was overlooked but was noticed near RC time
> - Blueprint was approved on 2018-02-12
>
> 3) Volume status remains "detaching" after a failure to detach a volume
> due to DeviceDetachFailed https://bugs.launchpad.net/nova/+bug/1786318
>
> - Bug was reported on 2018-08-09, the day of RC1
> - The change that introduced the regression landed on 2018-02-21
> https://review.openstack.org/546423
> - Unrelated to a blueprint, the regression was part of a bug fix
> - Question: why wasn't this caught earlier?
> - Answer: Unit tests were not asserting the call to the roll_detaching
> volume API. Coverage has since been added along with the bug fix
> https://review.openstack.org/590439
>
> 4) OVB overcloud deploy fails on nova placement errors
> https://bugs.launchpad.net/nova/+bug/1787910
>
> - Bug was reported on 2018-08-20
> - Change that caused the regression landed on 2018-07-26, FF day
> https://review.openstack.org/517921
> - Blueprint was approved on 2018-05-16
> - Was found because of a failure in the
> legacy-periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master
> CI job. The ironic-inspector CI upstream also failed because of this, as
> noted by dtantsur.
> - Question: why did it take nearly a month for the failure to be
> noticed? Is there any way we can cover this in our
> ironic-tempest-dsvm-ipa-wholedisk-bios-agent_ipmitool-tinyipa job?
>
> 5) when live migration fails due to a internal error rollback is not
> handled correctly https://bugs.launchpad.net/nova/+bug/1788014
>
> - Bug was reported on 2018-08-20
> - The change that caused the regression landed on 2018-07-26, FF day
> https://review.openstack.org/434870
> - Unrelated to a blueprint, the regression was part of a bug fix
> - Was found because sean-k-mooney was doing live migrations and found
> that when a LM failed because of a QEMU internal error, the VM remained
> ACTIVE but the VM no longer had network connectivity.
> - Question: why wasn't this caught earlier?
> - Answer: We would need a live migration job scenario that intentionally
> initiates and fails a live migration, then verify network connectivity
> after the rollback occurs.
> - Question: can we add something like that?
>
> 6) nova-manage db online_data_migrations hangs on instances with no host
> set https://bugs.launchpad.net/nova/+bug/1788115
>
> - Bug was reported on 2018-08-21
> - The patch that introduced the bug landed on 2018-05-30
> https://review.openstack.org/567878
> - Unrelated to a blueprint, the regression was part of a bug fix
> - Question: why wasn't this caught earlier?
> - Answer: To hit the bug, you had to have had instances with no host set
> (that failed to schedule) in your database during an upgrade. This does
> not happen during the grenade job
> - Question: could we add anything to the grenade job that would leave
> some instances with no host set to cover cases like this?
>
> 7) release notes erroneously say that nova-consoleauth doesn't have to
> run in Rocky https://bugs.launchpad.net/nova/+bug/1788470
>
> - Bug was reported on 2018-08-22
> - The patches that conveyed the wrong information for the docs landed on
> 2018-05-07 https://review.openstack.org/565367
> - Blueprint was approved on 2018-03-12
> - Question: why wasn't this caught earlier?
> - Answer: The patches should have been tested by a devstack change that
> runs without the nova-consoleauth service, to verify the system can work
> without the service.
> - Question: can we add test coverage for that?
> - Answer: Yes, it's proposed as a WIP https://review.openstack.org/607070
>
> 8) libvirt driver rx_queue_size changes break SR-IOV
> https://bugs.launchpad.net/nova/+bug/1789074
>
> - Bug was reported on 2018-08-25
> - The change that caused the regression landed on 2018-04-23
> https://review.openstack.org/484997
> - Blueprint was approved on 2018-03-23
> - Was found because moshele tried to create a server with an SRIOV
> interface (PF or VF) with rx_queue_size and tx_queue_size set in
> nova.conf and it failed.
> - Question: why wasn't this caught earlier?
> - Answer: Exposing the bug required both setting rx/tx queue sizes and
> booting a server with a SRIOV interface. We don't have hardware for
> testing SRIOV in the gate.
> - Question: is there any other way to test this via functional tests?
> From what I understand, there isn't.
>
> Based on all of this, I don't believe the spec freeze at milestone 2 was
> related to the late-breaking regressions we had around RC time.
Agree.
> We
> approved 10 additional blueprints between r-1 2018-04-19 and r-2
> 2018-06-07 [4]. Half of the regressions were unrelated to feature work
> and were introduced as part of bug fixes. Of the other half, 3 out of 4
> had blueprints approved before r-1. Only one involved a blueprint
> approved after r-1.
>
> In a couple of cases, the patch that introduced the regression landed on
> feature freeze day, with the bugs being found about a month later, which
> was about two weeks after RC1. In most cases, the regression landed
> months before the bug was found, because of lack of test coverage.
>
> It seems like if we could do anything about this, it would be to move
> feature freeze day sooner so we have more time between FF and RC.
Yes, I think this would be a thing worth trying.
Or perhaps...
> But I
> have a feeling that some of the bugs get found when people take the RC
> and try it out.
...perhaps moving both FF and RC1 earlier before the release date. Not
sure it's worth the pain of having a longer RC window.
> Based on what I've found here, I think we are fine to stick with using
> milestone 2 as the spec freeze date.
Agree.
> And we might want to consider
> moving feature freeze day sooner to give more time between feature
> freeze and RC. This cycle, even though it is a longer cycle, we still
> have only 2 weeks between s-3 and rc1 [5].
>
> I'm ambivalent about changing the usual milestone 3 feature freeze date
> because I have a feeling that people tend to try things out once RC is
> released, but maybe I'm wrong on that. What are your thoughts?
>
> Finally, please do jump in and reply if you have info to share or
Random thoughts, in no particular order:
- The number of regressions due to bug fixes may indicate that we were
rushing, and/or weren't reviewing as deeply/thoroughly as we should have
been. Which may very well be a factor of the code being complex,
convoluted, or otherwise incomprehensible to almost everyone. If so, is
it worth taking some time away from feature work to catch up on tech
debt, refactoring, simplification, etc.?
- The bugfix regressions could also be due to improper prioritization.
Do we have a sense of whether the cure was worse than the disease in any
of these cases? Would we have been better off leaving the original bugs
alone? Can we use that to inform our triaging process, especially late
in the cycle?
- We may want to consider the possibility that this was an anomaly, on
the edge of the bell curve of normal, not related to anything we did or
didn't do. By the same token, if we change something and Stein goes
better, can we really pat ourselves on the back and say we made the
difference by whatever we changed? I'm not suggesting we take no action;
but that if we do something speculative like moving dates around, I'm
not sure we can really measure success or failure. OTOH, if we do
something that'll be beneficial regardless (like
refactoring/simplification) then we win even if the RC storm is repeated.
> questions to ask about the regressions listed above. I'm especially
> interested in getting answers to the questions I posed earlier inline
> about whether there's anything we can do to cover some of the cases with
> our CI.
>
> Cheers,
> -melanie
>
>
> [1] https://etherpad.openstack.org/p/nova-rocky-retrospective
> [2] https://etherpad.openstack.org/p/nova-rocky-release-candidate-todo
> [3] https://etherpad.openstack.org/p/nova-rocky-rc-regression-analysis
> [4]
> https://docs.google.com/spreadsheets/d/e/2PACX-1vQicKStmnQFcOdnZU56ynJmn8e0__jYsr4FWXs3GrDsDzg1hwHofvJnuSieCH3ExbPngoebmEeY0waH/pubhtml?gid=128173249&single=true
>
> [5] https://wiki.openstack.org/wiki/Nova/Stein_Release_Schedule
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list