<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Nov 24, 2013 at 9:58 PM, Robert Collins <span dir="ltr"><<a href="mailto:robertc@robertcollins.net" target="_blank">robertc@robertcollins.net</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">I have a proposal - I think we should mark all recheck bugs critical,<br>


and the respective project PTLs should actively shop around amongst<br>

their contributors to get them fixed before other work: we should<br>

drive the known set of nondeterministic issues down to 0 and keep it<br>

there.<br></blockquote><div><br></div><div><br></div><div>Yes! In fact we are already working towards that. See <a href="http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html">http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html</a> </div>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

-Rob<br>

<div><div class="h5"><br>

On 25 November 2013 18:00, Joe Gordon <<a href="mailto:joe.gordon0@gmail.com">joe.gordon0@gmail.com</a>> wrote:<br>

> Hi All,<br>

><br>

> TL;DR Last week the gate got wedged on nondeterministic failures. Unwedging<br>

> the gate required drastic actions to fix bugs.<br>

><br>

> Starting on November 15th, gate jobs have been getting progressively less<br>

> stable with not enough attention given to fixing the issues, until we got to<br>

> the point where the gate was almost fully wedged.  No one bug caused this,<br>

> it was a collection of bugs that got us here. The gate protects us from code<br>

> that fails 100% of the time, but if a patch fails 10% of the time it can<br>

> slip through.  Add a few of these bugs together and we get the gate to a<br>

> point where the gate is fully wedged and fixing it without circumventing the<br>

> gate (something we never want to do) is very hard.  It took just 2 new<br>

> nondeterministic bugs to take us from a gate that mostly worked, to a gate<br>

> that was almost fully wedged.  Last week we found out Jeremy Stanley (fungi)<br>

> was right when he said, "nondeterministic failures breed more<br>

> nondeterministic failures, because people are so used to having to reverify<br>

> their patches to get them to merge that they are doing so even when it's<br>

> their patch which is introducing a nondeterministic bug."<br>

><br>

> Side note: This is not the first time we wedge the gate, the first time was<br>

> around September 26th, right when we were cutting Havana release candidates.<br>

> In response we wrote elastic-recheck<br>

> (<a href="http://status.openstack.org/elastic-recheck/" target="_blank">http://status.openstack.org/elastic-recheck/</a>) to better track what bugs we<br>

> were seeing.<br>

><br>

> Gate stability according to Graphite: <a href="http://paste.openstack.org/show/53765/" target="_blank">http://paste.openstack.org/show/53765/</a><br>

> (they are huge because they encode entire queries, so including as a<br>

> pastebin).<br>

><br>

> After sending out an email to ask for help fixing the top known gate bugs<br>

> (<a href="http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html</a>),<br>

> we had a few possible fixes. But with the gate wedged, the merge queue was<br>

> 145 patches  long and could take days to be processed. In the worst case,<br>

> none of the patches merging, it would take about 1 hour per patch. So on<br>

> November 20th we asked for a freeze on any non-critical bug fixes (<br>

> <a href="http://lists.openstack.org/pipermail/openstack-dev/2013-November/019941.html" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2013-November/019941.html</a><br>

> ), and kicked everything out of the merge queue and put our possible bug<br>

> fixes at the front. Even with these drastic measures it still took 26 hours<br>

> to finally unwedge the gate. In 26 hours we got the check queue failure rate<br>

> (always higher then the gate failure rate) down from around 87% failure to<br>

> below 10% failure. And we still have many more bugs to track down and fix in<br>

> order to improve gate stability.<br>

><br>

><br>

> 8 Major bug fixes later, we have the gate back to a reasonable failure rate.<br>

> But how did things get so bad? I'm glad you asked, here is a blow by blow<br>

> account.<br>

><br>

> The gate has not been completely stable for a very long time, and it only<br>

> took two new bugs to wedge the gate. Starting with the list of bugs we<br>

> identified via elastic-recheck, we fixed 4 bugs that have been in the gate<br>

> for a few weeks already.<br>

><br>

><br>

>  <a href="https://bugs.launchpad.net/bugs/1224001" target="_blank">https://bugs.launchpad.net/bugs/1224001</a> "test_network_basic_ops fails<br>

> waiting for network to become available"<br>

><br>

> <a href="https://review.openstack.org/57290" target="_blank">https://review.openstack.org/57290</a> was the fix which depended on<br>

> <a href="https://review.openstack.org/53188" target="_blank">https://review.openstack.org/53188</a> and <a href="https://review.openstack.org/57475" target="_blank">https://review.openstack.org/57475</a>.<br>

><br>

> This fixed a race condition where the IP address from DHCP was not received<br>

> by the VM at the right time. Minimize polling on the agent is now defaulted<br>

> to True, which should reduce the time needed for configuring an interface on<br>

> br-int consistently.<br>

><br>

> <a href="https://bugs.launchpad.net/bugs/1252514" target="_blank">https://bugs.launchpad.net/bugs/1252514</a> "Swift returning errors when setup<br>

> using devstack"<br>

><br>

> Fix <a href="https://review.openstack.org/#/c/57373/" target="_blank">https://review.openstack.org/#/c/57373/</a><br>

><br>

> There were a few swift related problems that were sorted out as well. Most<br>

> had to do with tuning swift properly for its use as a glance backend in the<br>

> gate, ensuring that timeout values were appropriate for the devstack test<br>

> slaves (in<br>

><br>

> resource constrained environments, the swift default timeouts could be<br>

> tripped frequently (logs showed the request would have finished successfully<br>

> given enough time)). Swift also had a race-condition in how it constructed<br>

> its sqlite3<br>

><br>

> files for containers and accounts, where it was not retrying operations when<br>

> the database was locked.<br>

><br>

> <a href="https://bugs.launchpad.net/swift/+bug/1243973" target="_blank">https://bugs.launchpad.net/swift/+bug/1243973</a> "Simultaneous PUT requests for<br>

> the same account..."<br>

><br>

> Fix <a href="https://review.openstack.org/#/c/57019/" target="_blank">https://review.openstack.org/#/c/57019/</a><br>

><br>

> This was not on our original list of bugs, but while in bug fix mode, we got<br>

> this one fixed as well<br>

><br>

> <a href="https://bugs.launchpad.net/bugs/1251784" target="_blank">https://bugs.launchpad.net/bugs/1251784</a> "nova+neutron scheduling error:<br>

> Connection to neutron failed: Maximum attempts reached<br>

><br>

> Fix <a href="https://review.openstack.org/#/c/57509/" target="_blank">https://review.openstack.org/#/c/57509/</a><br>

><br>

> Uncovered on mailing list<br>

> (<a href="http://lists.openstack.org/pipermail/openstack-dev/2013-November/019906.html" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2013-November/019906.html</a>)<br>

><br>

> Nova had a very old version of oslo's local.py which is used for managing<br>

> references to local variables in coroutines. The old version had a pretty<br>

> significant bug that basically meant non-weak references to variables were<br>

> not managed properly. This fix has made the nova neutron interactions much<br>

> more reliable.<br>

><br>

> This fixed the number 2 bug on our list of top gate bugs<br>

> (<a href="http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html</a><br>

> )!<br>

><br>

><br>

> In addition to fixing 4 old bugs, we fixed two new bugs that were introduced<br>

> / exposed this week.<br>

><br>

> <a href="https://bugs.launchpad.net/bugs/1251920" target="_blank">https://bugs.launchpad.net/bugs/1251920</a> "Tempest failures due to failure to<br>

> return console logs from an instance Project"<br>

><br>

> Bug: <a href="https://review.openstack.org/#/c/54363/" target="_blank">https://review.openstack.org/#/c/54363/</a> [Tempest]<br>

><br>

> Fix(work around): <a href="https://review.openstack.org/#/c/57193/" target="_blank">https://review.openstack.org/#/c/57193/</a><br>

><br>

> After many false starts and banging our head against the wall, we identified<br>

> a change to tempest, <a href="https://review.openstack.org/54363" target="_blank">https://review.openstack.org/54363</a> , that added a new<br>

> test around the same time as bug 1251920 became a problem. Forcing tempest<br>

> to skip this test had a very high incidence of success without any 1251920<br>

> related failures. As a result we are working arond this bug by skipping that<br>

> test, until it can be run without major impact to the gate.<br>

><br>

> The change that introduced this problematic test had to go through the gate<br>

> four times before it would merge, though only one of the 3 failed attemps<br>

> appears to have triggered 1251920.  Or as  Jeremy Stanley  (fungi) said<br>

> "nondeterministic failures breed more nondeterministic failures, because<br>

> people are so used to having to reverify their patches to get them to merge<br>

> that they are doing so even when it's their patch which is introducing a<br>

> nondeterministic bug."<br>

><br>

> <a href="https://bugs.launchpad.net/bugs/1252170" target="_blank">https://bugs.launchpad.net/bugs/1252170</a> "tempest.scenario<br>

> test_resize_server_confirm failed in grenade"<br>

><br>

> Fix <a href="https://review.openstack.org/#/c/57357/" target="_blank">https://review.openstack.org/#/c/57357/</a><br>

><br>

> Fix <a href="https://review.openstack.org/#/c/57572/" target="_blank">https://review.openstack.org/#/c/57572/</a><br>

><br>

> First we started running post Grenade upgrade tests in parallel (to fix<br>

> another bug) which would normally be fine, but Grenade wasn't configuring<br>

> the small flavors typically used by tempest so it was possible for the<br>

> devstack Jenkins slaves to run out of memory when starting many larger VMs<br>

> in parallel. To fix this devstack lib/tempest has been updated to create the<br>

> flavors only if they don't exist and Grenade is allowing tempest to use its<br>

> default instance flavors.<br>

><br>

><br>

><br>

> Now that we have the gate back into working order, we are working on the<br>

> next steps to prevent this from happening again.  The two most immediate<br>

> changes are:<br>

><br>

> Doing a better job of triaging gate bugs<br>

> (<a href="http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html" target="_blank">http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html</a><br>

> ).<br>

><br>

> In the next few days we will remove  'reverify no bug' (although you will<br>

> still be able to run 'reverify bug x'.<br>

><br>

><br>

> Best,<br>

> Joe Gordon<br>

> Clark Boylan<br>

><br>

</div></div>> _______________________________________________<br>

> OpenStack-dev mailing list<br>

> <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

><br>

<span class=""><font color="#888888"><br>

<br>

<br>

--<br>

Robert Collins <<a href="mailto:rbtcollins@hp.com">rbtcollins@hp.com</a>><br>

Distinguished Technologist<br>

HP Converged Cloud<br>

<br>

_______________________________________________<br>

OpenStack-dev mailing list<br>

<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</font></span></blockquote></div><br></div></div>