[openstack-dev] Unwedging the gate

Joshua Harlow harlowja at yahoo-inc.com
Mon Nov 25 17:33:59 UTC 2013


+2

Sent from my really tiny device...

> On Nov 25, 2013, at 5:02 AM, "Davanum Srinivas" <davanum at gmail.com> wrote:
> 
> Many thanks to everyone who helped with the many fixes. Kudos to
> Joe/Clark for spear heading the effort!
> 
> -- dims
> 
>> On Mon, Nov 25, 2013 at 12:00 AM, Joe Gordon <joe.gordon0 at gmail.com> wrote:
>> Hi All,
>> 
>> TL;DR Last week the gate got wedged on nondeterministic failures. Unwedging
>> the gate required drastic actions to fix bugs.
>> 
>> Starting on November 15th, gate jobs have been getting progressively less
>> stable with not enough attention given to fixing the issues, until we got to
>> the point where the gate was almost fully wedged.  No one bug caused this,
>> it was a collection of bugs that got us here. The gate protects us from code
>> that fails 100% of the time, but if a patch fails 10% of the time it can
>> slip through.  Add a few of these bugs together and we get the gate to a
>> point where the gate is fully wedged and fixing it without circumventing the
>> gate (something we never want to do) is very hard.  It took just 2 new
>> nondeterministic bugs to take us from a gate that mostly worked, to a gate
>> that was almost fully wedged.  Last week we found out Jeremy Stanley (fungi)
>> was right when he said, "nondeterministic failures breed more
>> nondeterministic failures, because people are so used to having to reverify
>> their patches to get them to merge that they are doing so even when it's
>> their patch which is introducing a nondeterministic bug."
>> 
>> Side note: This is not the first time we wedge the gate, the first time was
>> around September 26th, right when we were cutting Havana release candidates.
>> In response we wrote elastic-recheck
>> (http://status.openstack.org/elastic-recheck/) to better track what bugs we
>> were seeing.
>> 
>> Gate stability according to Graphite: http://paste.openstack.org/show/53765/
>> (they are huge because they encode entire queries, so including as a
>> pastebin).
>> 
>> After sending out an email to ask for help fixing the top known gate bugs
>> (http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html),
>> we had a few possible fixes. But with the gate wedged, the merge queue was
>> 145 patches  long and could take days to be processed. In the worst case,
>> none of the patches merging, it would take about 1 hour per patch. So on
>> November 20th we asked for a freeze on any non-critical bug fixes (
>> http://lists.openstack.org/pipermail/openstack-dev/2013-November/019941.html
>> ), and kicked everything out of the merge queue and put our possible bug
>> fixes at the front. Even with these drastic measures it still took 26 hours
>> to finally unwedge the gate. In 26 hours we got the check queue failure rate
>> (always higher then the gate failure rate) down from around 87% failure to
>> below 10% failure. And we still have many more bugs to track down and fix in
>> order to improve gate stability.
>> 
>> 
>> 8 Major bug fixes later, we have the gate back to a reasonable failure rate.
>> But how did things get so bad? I'm glad you asked, here is a blow by blow
>> account.
>> 
>> The gate has not been completely stable for a very long time, and it only
>> took two new bugs to wedge the gate. Starting with the list of bugs we
>> identified via elastic-recheck, we fixed 4 bugs that have been in the gate
>> for a few weeks already.
>> 
>> 
>> https://bugs.launchpad.net/bugs/1224001 "test_network_basic_ops fails
>> waiting for network to become available"
>> 
>> https://review.openstack.org/57290 was the fix which depended on
>> https://review.openstack.org/53188 and https://review.openstack.org/57475.
>> 
>> This fixed a race condition where the IP address from DHCP was not received
>> by the VM at the right time. Minimize polling on the agent is now defaulted
>> to True, which should reduce the time needed for configuring an interface on
>> br-int consistently.
>> 
>> https://bugs.launchpad.net/bugs/1252514 "Swift returning errors when setup
>> using devstack"
>> 
>> Fix https://review.openstack.org/#/c/57373/
>> 
>> There were a few swift related problems that were sorted out as well. Most
>> had to do with tuning swift properly for its use as a glance backend in the
>> gate, ensuring that timeout values were appropriate for the devstack test
>> slaves (in
>> 
>> resource constrained environments, the swift default timeouts could be
>> tripped frequently (logs showed the request would have finished successfully
>> given enough time)). Swift also had a race-condition in how it constructed
>> its sqlite3
>> 
>> files for containers and accounts, where it was not retrying operations when
>> the database was locked.
>> 
>> https://bugs.launchpad.net/swift/+bug/1243973 "Simultaneous PUT requests for
>> the same account..."
>> 
>> Fix https://review.openstack.org/#/c/57019/
>> 
>> This was not on our original list of bugs, but while in bug fix mode, we got
>> this one fixed as well
>> 
>> https://bugs.launchpad.net/bugs/1251784 "nova+neutron scheduling error:
>> Connection to neutron failed: Maximum attempts reached
>> 
>> Fix https://review.openstack.org/#/c/57509/
>> 
>> Uncovered on mailing list
>> (http://lists.openstack.org/pipermail/openstack-dev/2013-November/019906.html)
>> 
>> Nova had a very old version of oslo's local.py which is used for managing
>> references to local variables in coroutines. The old version had a pretty
>> significant bug that basically meant non-weak references to variables were
>> not managed properly. This fix has made the nova neutron interactions much
>> more reliable.
>> 
>> This fixed the number 2 bug on our list of top gate bugs
>> (http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html
>> )!
>> 
>> 
>> In addition to fixing 4 old bugs, we fixed two new bugs that were introduced
>> / exposed this week.
>> 
>> https://bugs.launchpad.net/bugs/1251920 "Tempest failures due to failure to
>> return console logs from an instance Project"
>> 
>> Bug: https://review.openstack.org/#/c/54363/ [Tempest]
>> 
>> Fix(work around): https://review.openstack.org/#/c/57193/
>> 
>> After many false starts and banging our head against the wall, we identified
>> a change to tempest, https://review.openstack.org/54363 , that added a new
>> test around the same time as bug 1251920 became a problem. Forcing tempest
>> to skip this test had a very high incidence of success without any 1251920
>> related failures. As a result we are working arond this bug by skipping that
>> test, until it can be run without major impact to the gate.
>> 
>> The change that introduced this problematic test had to go through the gate
>> four times before it would merge, though only one of the 3 failed attemps
>> appears to have triggered 1251920.  Or as  Jeremy Stanley  (fungi) said
>> "nondeterministic failures breed more nondeterministic failures, because
>> people are so used to having to reverify their patches to get them to merge
>> that they are doing so even when it's their patch which is introducing a
>> nondeterministic bug."
>> 
>> https://bugs.launchpad.net/bugs/1252170 "tempest.scenario
>> test_resize_server_confirm failed in grenade"
>> 
>> Fix https://review.openstack.org/#/c/57357/
>> 
>> Fix https://review.openstack.org/#/c/57572/
>> 
>> First we started running post Grenade upgrade tests in parallel (to fix
>> another bug) which would normally be fine, but Grenade wasn't configuring
>> the small flavors typically used by tempest so it was possible for the
>> devstack Jenkins slaves to run out of memory when starting many larger VMs
>> in parallel. To fix this devstack lib/tempest has been updated to create the
>> flavors only if they don't exist and Grenade is allowing tempest to use its
>> default instance flavors.
>> 
>> 
>> 
>> Now that we have the gate back into working order, we are working on the
>> next steps to prevent this from happening again.  The two most immediate
>> changes are:
>> 
>> Doing a better job of triaging gate bugs
>> (http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html
>> ).
>> 
>> In the next few days we will remove  'reverify no bug' (although you will
>> still be able to run 'reverify bug x'.
>> 
>> 
>> Best,
>> Joe Gordon
>> Clark Boylan
>> 
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> 
> -- 
> Davanum Srinivas :: http://davanum.wordpress.com
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list