[openstack-dev] Unwedging the gate

Joe Gordon joe.gordon0 at gmail.com
Mon Nov 25 06:25:55 UTC 2013


On Sun, Nov 24, 2013 at 9:58 PM, Robert Collins
<robertc at robertcollins.net>wrote:

> I have a proposal - I think we should mark all recheck bugs critical,
> and the respective project PTLs should actively shop around amongst
> their contributors to get them fixed before other work: we should
> drive the known set of nondeterministic issues down to 0 and keep it
> there.
>


Yes! In fact we are already working towards that. See
http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html


>
> -Rob
>
> On 25 November 2013 18:00, Joe Gordon <joe.gordon0 at gmail.com> wrote:
> > Hi All,
> >
> > TL;DR Last week the gate got wedged on nondeterministic failures.
> Unwedging
> > the gate required drastic actions to fix bugs.
> >
> > Starting on November 15th, gate jobs have been getting progressively less
> > stable with not enough attention given to fixing the issues, until we
> got to
> > the point where the gate was almost fully wedged.  No one bug caused
> this,
> > it was a collection of bugs that got us here. The gate protects us from
> code
> > that fails 100% of the time, but if a patch fails 10% of the time it can
> > slip through.  Add a few of these bugs together and we get the gate to a
> > point where the gate is fully wedged and fixing it without circumventing
> the
> > gate (something we never want to do) is very hard.  It took just 2 new
> > nondeterministic bugs to take us from a gate that mostly worked, to a
> gate
> > that was almost fully wedged.  Last week we found out Jeremy Stanley
> (fungi)
> > was right when he said, "nondeterministic failures breed more
> > nondeterministic failures, because people are so used to having to
> reverify
> > their patches to get them to merge that they are doing so even when it's
> > their patch which is introducing a nondeterministic bug."
> >
> > Side note: This is not the first time we wedge the gate, the first time
> was
> > around September 26th, right when we were cutting Havana release
> candidates.
> > In response we wrote elastic-recheck
> > (http://status.openstack.org/elastic-recheck/) to better track what
> bugs we
> > were seeing.
> >
> > Gate stability according to Graphite:
> http://paste.openstack.org/show/53765/
> > (they are huge because they encode entire queries, so including as a
> > pastebin).
> >
> > After sending out an email to ask for help fixing the top known gate bugs
> > (
> http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html
> ),
> > we had a few possible fixes. But with the gate wedged, the merge queue
> was
> > 145 patches  long and could take days to be processed. In the worst case,
> > none of the patches merging, it would take about 1 hour per patch. So on
> > November 20th we asked for a freeze on any non-critical bug fixes (
> >
> http://lists.openstack.org/pipermail/openstack-dev/2013-November/019941.html
> > ), and kicked everything out of the merge queue and put our possible bug
> > fixes at the front. Even with these drastic measures it still took 26
> hours
> > to finally unwedge the gate. In 26 hours we got the check queue failure
> rate
> > (always higher then the gate failure rate) down from around 87% failure
> to
> > below 10% failure. And we still have many more bugs to track down and
> fix in
> > order to improve gate stability.
> >
> >
> > 8 Major bug fixes later, we have the gate back to a reasonable failure
> rate.
> > But how did things get so bad? I'm glad you asked, here is a blow by blow
> > account.
> >
> > The gate has not been completely stable for a very long time, and it only
> > took two new bugs to wedge the gate. Starting with the list of bugs we
> > identified via elastic-recheck, we fixed 4 bugs that have been in the
> gate
> > for a few weeks already.
> >
> >
> >  https://bugs.launchpad.net/bugs/1224001 "test_network_basic_ops fails
> > waiting for network to become available"
> >
> > https://review.openstack.org/57290 was the fix which depended on
> > https://review.openstack.org/53188 and
> https://review.openstack.org/57475.
> >
> > This fixed a race condition where the IP address from DHCP was not
> received
> > by the VM at the right time. Minimize polling on the agent is now
> defaulted
> > to True, which should reduce the time needed for configuring an
> interface on
> > br-int consistently.
> >
> > https://bugs.launchpad.net/bugs/1252514 "Swift returning errors when
> setup
> > using devstack"
> >
> > Fix https://review.openstack.org/#/c/57373/
> >
> > There were a few swift related problems that were sorted out as well.
> Most
> > had to do with tuning swift properly for its use as a glance backend in
> the
> > gate, ensuring that timeout values were appropriate for the devstack test
> > slaves (in
> >
> > resource constrained environments, the swift default timeouts could be
> > tripped frequently (logs showed the request would have finished
> successfully
> > given enough time)). Swift also had a race-condition in how it
> constructed
> > its sqlite3
> >
> > files for containers and accounts, where it was not retrying operations
> when
> > the database was locked.
> >
> > https://bugs.launchpad.net/swift/+bug/1243973 "Simultaneous PUT
> requests for
> > the same account..."
> >
> > Fix https://review.openstack.org/#/c/57019/
> >
> > This was not on our original list of bugs, but while in bug fix mode, we
> got
> > this one fixed as well
> >
> > https://bugs.launchpad.net/bugs/1251784 "nova+neutron scheduling error:
> > Connection to neutron failed: Maximum attempts reached
> >
> > Fix https://review.openstack.org/#/c/57509/
> >
> > Uncovered on mailing list
> > (
> http://lists.openstack.org/pipermail/openstack-dev/2013-November/019906.html
> )
> >
> > Nova had a very old version of oslo's local.py which is used for managing
> > references to local variables in coroutines. The old version had a pretty
> > significant bug that basically meant non-weak references to variables
> were
> > not managed properly. This fix has made the nova neutron interactions
> much
> > more reliable.
> >
> > This fixed the number 2 bug on our list of top gate bugs
> > (
> http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html
> > )!
> >
> >
> > In addition to fixing 4 old bugs, we fixed two new bugs that were
> introduced
> > / exposed this week.
> >
> > https://bugs.launchpad.net/bugs/1251920 "Tempest failures due to
> failure to
> > return console logs from an instance Project"
> >
> > Bug: https://review.openstack.org/#/c/54363/ [Tempest]
> >
> > Fix(work around): https://review.openstack.org/#/c/57193/
> >
> > After many false starts and banging our head against the wall, we
> identified
> > a change to tempest, https://review.openstack.org/54363 , that added a
> new
> > test around the same time as bug 1251920 became a problem. Forcing
> tempest
> > to skip this test had a very high incidence of success without any
> 1251920
> > related failures. As a result we are working arond this bug by skipping
> that
> > test, until it can be run without major impact to the gate.
> >
> > The change that introduced this problematic test had to go through the
> gate
> > four times before it would merge, though only one of the 3 failed attemps
> > appears to have triggered 1251920.  Or as  Jeremy Stanley  (fungi) said
> > "nondeterministic failures breed more nondeterministic failures, because
> > people are so used to having to reverify their patches to get them to
> merge
> > that they are doing so even when it's their patch which is introducing a
> > nondeterministic bug."
> >
> > https://bugs.launchpad.net/bugs/1252170 "tempest.scenario
> > test_resize_server_confirm failed in grenade"
> >
> > Fix https://review.openstack.org/#/c/57357/
> >
> > Fix https://review.openstack.org/#/c/57572/
> >
> > First we started running post Grenade upgrade tests in parallel (to fix
> > another bug) which would normally be fine, but Grenade wasn't configuring
> > the small flavors typically used by tempest so it was possible for the
> > devstack Jenkins slaves to run out of memory when starting many larger
> VMs
> > in parallel. To fix this devstack lib/tempest has been updated to create
> the
> > flavors only if they don't exist and Grenade is allowing tempest to use
> its
> > default instance flavors.
> >
> >
> >
> > Now that we have the gate back into working order, we are working on the
> > next steps to prevent this from happening again.  The two most immediate
> > changes are:
> >
> > Doing a better job of triaging gate bugs
> > (
> http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html
> > ).
> >
> > In the next few days we will remove  'reverify no bug' (although you will
> > still be able to run 'reverify bug x'.
> >
> >
> > Best,
> > Joe Gordon
> > Clark Boylan
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
>
> --
> Robert Collins <rbtcollins at hp.com>
> Distinguished Technologist
> HP Converged Cloud
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131124/f45d274e/attachment.html>


More information about the OpenStack-dev mailing list