[openstack-dev] Unwedging the gate

Joe Gordon joe.gordon0 at gmail.com
Mon Nov 25 05:00:58 UTC 2013


Hi All,

TL;DR Last week the gate got wedged on nondeterministic failures. Unwedging
the gate required drastic actions to fix bugs.

Starting on November 15th, gate jobs have been getting progressively less
stable with not enough attention given to fixing the issues, until we got
to the point where the gate was almost fully wedged.  No one bug caused
this, it was a collection of bugs that got us here. The gate protects us
from code that fails 100% of the time, but if a patch fails 10% of the time
it can slip through.  Add a few of these bugs together and we get the gate
to a point where the gate is fully wedged and fixing it without
circumventing the gate (something we never want to do) is very hard.  It
took just 2 new nondeterministic bugs to take us from a gate that mostly
worked, to a gate that was almost fully wedged.  Last week we found out
Jeremy Stanley (fungi) was right when he said, "nondeterministic failures
breed more nondeterministic failures, because people are so used to having
to reverify their patches to get them to merge that they are doing so even
when it's their patch which is introducing a nondeterministic bug."

Side note: This is not the first time we wedge the gate, the first time was
around September 26th, right when we were cutting Havana release
candidates.  In response we wrote elastic-recheck (
http://status.openstack.org/elastic-recheck/) to better track what bugs we
were seeing.

Gate stability according to Graphite:
http://paste.openstack.org/show/53765/ (they
are huge because they encode entire queries, so including as a pastebin).

After sending out an email to ask for help fixing the top known gate bugs (
http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html),
we had a few possible fixes. But with the gate wedged, the merge queue was
145 patches  long and could take days to be processed. In the worst case,
none of the patches merging, it would take about 1 hour per patch. So on
November 20th we asked for a freeze on any non-critical bug fixes (
http://lists.openstack.org/pipermail/openstack-dev/2013-November/019941.html
 ), and kicked everything out of the merge queue and put our possible bug
fixes at the front. Even with these drastic measures it still took 26 hours
to finally unwedge the gate. In 26 hours we got the check queue failure
rate (always higher then the gate failure rate) down from around 87%
failure to below 10% failure. And we still have many more bugs to track
down and fix in order to improve gate stability.


8 Major bug fixes later, we have the gate back to a reasonable failure
rate. But how did things get so bad? I'm glad you asked, here is a blow by
blow account.

The gate has not been completely stable for a very long time, and it only
took two new bugs to wedge the gate. Starting with the list of bugs we
identified via elastic-recheck, we fixed 4 bugs that have been in the gate
for a few weeks already.



   -  https://bugs.launchpad.net/bugs/1224001 "test_network_basic_ops fails
   waiting for network to become available"


   - https://review.openstack.org/57290 was the fix which depended on
   https://review.openstack.org/53188 and https://review.openstack.org/57475
   .


   - This fixed a race condition where the IP address from DHCP was not
   received by the VM at the right time. Minimize polling on the agent is now
   defaulted to True, which should reduce the time needed for configuring an
   interface on br-int consistently.


   - https://bugs.launchpad.net/bugs/1252514 "Swift returning errors when
   setup using devstack"


   - Fix https://review.openstack.org/#/c/57373/


   - There were a few swift related problems that were sorted out as well.
   Most had to do with tuning swift properly for its use as a glance backend
   in the gate, ensuring that timeout values were appropriate for the devstack
   test slaves (in


   - resource constrained environments, the swift default timeouts could be
   tripped frequently (logs showed the request would have finished
   successfully given enough time)). Swift also had a race-condition in how it
   constructed its sqlite3


   - files for containers and accounts, where it was not retrying
   operations when the database was locked.


   - https://bugs.launchpad.net/swift/+bug/1243973 "Simultaneous PUT
   requests for the same account..."


   - Fix https://review.openstack.org/#/c/57019/


   - This was not on our original list of bugs, but while in bug fix mode,
   we got this one fixed as well


   - https://bugs.launchpad.net/bugs/1251784 "nova+neutron scheduling
   error: Connection to neutron failed: Maximum attempts reached


   - Fix https://review.openstack.org/#/c/57509/


   - Uncovered on mailing list (
   http://lists.openstack.org/pipermail/openstack-dev/2013-November/019906.html)


   - Nova had a very old version of oslo's local.py which is used for
   managing references to local variables in coroutines. The old version had a
   pretty significant bug that basically meant non-weak references to
   variables were not managed properly. This fix has made the nova neutron
   interactions much more reliable.


   - This fixed the number 2 bug on our list of top gate bugs (
   http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html
    )!


In addition to fixing 4 old bugs, we fixed two new bugs that were
introduced / exposed this week.


   - https://bugs.launchpad.net/bugs/1251920 "Tempest failures due to
   failure to return console logs from an instance Project"


   - Bug: https://review.openstack.org/#/c/54363/ [Tempest]


   - Fix(work around): https://review.openstack.org/#/c/57193/


   - After many false starts and banging our head against the wall, we
   identified a change to tempest, https://review.openstack.org/54363 ,
   that added a new test around the same time as bug 1251920 became a
   problem. Forcing tempest to skip this test had a very high incidence of
   success without any 1251920 related failures. As a result we are working
   arond this bug by skipping that test, until it can be run without major
   impact to the gate.


   - The change that introduced this problematic test had to go through the
   gate four times before it would merge, though only one of the 3 failed
   attemps appears to have triggered 1251920.  Or as  Jeremy Stanley
   (fungi) said "nondeterministic failures breed more nondeterministic
   failures, because people are so used to having to reverify their patches to
   get them to merge that they are doing so even when it's their patch
   which is introducing a nondeterministic bug."


   - https://bugs.launchpad.net/bugs/1252170 "tempest.scenario
   test_resize_server_confirm failed in grenade"


   - Fix https://review.openstack.org/#/c/57357/


   - Fix https://review.openstack.org/#/c/57572/


   - First we started running post Grenade upgrade tests in parallel (to
   fix another bug) which would normally be fine, but Grenade wasn't
   configuring the small flavors typically used by tempest so it was
   possible for the devstack Jenkins slaves to run out of memory when
   starting many larger VMs in parallel. To fix this devstack lib/tempest
   has been updated to create the flavors only if they don't exist and
   Grenade is allowing tempest to use its default instance flavors.



Now that we have the gate back into working order, we are working on the
next steps to prevent this from happening again.  The two most immediate
changes are:

   - Doing a better job of triaging gate bugs  (
   http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html
    ).


   - In the next few days we will remove  'reverify no bug' (although you
   will still be able to run 'reverify bug x'.


Best,
Joe Gordon
Clark Boylan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131124/5d6da6e0/attachment.html>


More information about the OpenStack-dev mailing list