[openstack-dev] Unwedging the gate

Gary Kotton gkotton at vmware.com
Mon Nov 25 07:11:17 UTC 2013


Hi,
Thanks for writing this up. This is very positive in a number of respects:

 1.  It is always good to do a  postmortem and try and learn from our mistakes
 2.  Visibility of the issues and their resolution

Thanks again
Gary

From: Joe Gordon <joe.gordon0 at gmail.com<mailto:joe.gordon0 at gmail.com>>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Date: Monday, November 25, 2013 7:00 AM
To: OpenStack Development Mailing List <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Subject: [openstack-dev] Unwedging the gate

Hi All,

TL;DR Last week the gate got wedged on nondeterministic failures. Unwedging the gate required drastic actions to fix bugs.

Starting on November 15th, gate jobs have been getting progressively less stable with not enough attention given to fixing the issues, until we got to the point where the gate was almost fully wedged.  No one bug caused this, it was a collection of bugs that got us here. The gate protects us from code that fails 100% of the time, but if a patch fails 10% of the time it can slip through.  Add a few of these bugs together and we get the gate to a point where the gate is fully wedged and fixing it without circumventing the gate (something we never want to do) is very hard.  It took just 2 new nondeterministic bugs to take us from a gate that mostly worked, to a gate that was almost fully wedged.  Last week we found out Jeremy Stanley (fungi) was right when he said, "nondeterministic failures breed more nondeterministic failures, because people are so used to having to reverify their patches to get them to merge that they are doing so even when it's their patch which is introducing a nondeterministic bug."

Side note: This is not the first time we wedge the gate, the first time was around September 26th, right when we were cutting Havana release candidates.  In response we wrote elastic-recheck (http://status.openstack.org/elastic-recheck/<https://urldefense.proofpoint.com/v1/url?u=http://status.openstack.org/elastic-recheck/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=af431ee694d1e60e6fc156235d8a87b889dc30439eeb8e350b33c652b60b9b01>) to better track what bugs we were seeing.

Gate stability according to Graphite: http://paste.openstack.org/show/53765/<https://urldefense.proofpoint.com/v1/url?u=http://paste.openstack.org/show/53765/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=b0fc20d44b6e396f7e84ce5cb419a11584e4e60265ae18c3db6d018c84b58851> (they are huge because they encode entire queries, so including as a pastebin).

After sending out an email to ask for help fixing the top known gate bugs (http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html<https://urldefense.proofpoint.com/v1/url?u=http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=1ff60e1563f23e385d9c97138f112a25a6e9249546b22d72aceb5b7de4cf7e1d>), we had a few possible fixes. But with the gate wedged, the merge queue was 145 patches  long and could take days to be processed. In the worst case, none of the patches merging, it would take about 1 hour per patch. So on November 20th we asked for a freeze on any non-critical bug fixes ( http://lists.openstack.org/pipermail/openstack-dev/2013-November/019941.html<https://urldefense.proofpoint.com/v1/url?u=http://lists.openstack.org/pipermail/openstack-dev/2013-November/019941.html&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=0a6f1c515698da61948ac172ec678a48e9932cc6d2f0f868ef975cbd05d6418d> ), and kicked everything out of the merge queue and put our possible bug fixes at the front. Even with these drastic measures it still took 26 hours to finally unwedge the gate. In 26 hours we got the check queue failure rate (always higher then the gate failure rate) down from around 87% failure to below 10% failure. And we still have many more bugs to track down and fix in order to improve gate stability.


8 Major bug fixes later, we have the gate back to a reasonable failure rate. But how did things get so bad? I'm glad you asked, here is a blow by blow account.

The gate has not been completely stable for a very long time, and it only took two new bugs to wedge the gate. Starting with the list of bugs we identified via elastic-recheck, we fixed 4 bugs that have been in the gate for a few weeks already.



 *    https://bugs.launchpad.net/bugs/1224001<https://urldefense.proofpoint.com/v1/url?u=https://bugs.launchpad.net/bugs/1224001&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=ab7c20a335359eed0318ff117f18749607d4a159795604596a7c1a700564bc6f> "test_network_basic_ops fails waiting for network to become available"

 *   https://review.openstack.org/57290<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/57290&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=3162ad0b0d28a31e21e229722d1de12fc603ca1267504bcb0100a2b3bdfc002e> was the fix which depended on https://review.openstack.org/53188<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/53188&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=e366e3e9572567f0516cdc02eb3f617c4a0ae128acc424a7046eda8c74da4972> and https://review.openstack.org/57475<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/57475&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=05bca13e5f139b986a0527671b57f96bc7b87f032097e75a6c9e2197d71f25c2>.

 *   This fixed a race condition where the IP address from DHCP was not received by the VM at the right time. Minimize polling on the agent is now defaulted to True, which should reduce the time needed for configuring an interface on br-int consistently.

 *   https://bugs.launchpad.net/bugs/1252514<https://urldefense.proofpoint.com/v1/url?u=https://bugs.launchpad.net/bugs/1252514&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=2e7fa740b1ab8a3f0cb2812549f799f9529d775f0e30d5fff46c8c9bbfd2df75> "Swift returning errors when setup using devstack"

 *   Fix https://review.openstack.org/#/c/57373/<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/%23/c/57373/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=74de42deb3b61d5979053caa835ed460f89b1433e602293304393c7a0fd07fc5>

 *   There were a few swift related problems that were sorted out as well. Most had to do with tuning swift properly for its use as a glance backend in the gate, ensuring that timeout values were appropriate for the devstack test slaves (in

 *   resource constrained environments, the swift default timeouts could be tripped frequently (logs showed the request would have finished successfully given enough time)). Swift also had a race-condition in how it constructed its sqlite3

 *   files for containers and accounts, where it was not retrying operations when the database was locked.

 *   https://bugs.launchpad.net/swift/+bug/1243973<https://urldefense.proofpoint.com/v1/url?u=https://bugs.launchpad.net/swift/%2Bbug/1243973&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=a9a18722cb9e3bdae6f3f0e957f03984d1d483aa7b86043337ccbe2c2ad3a76a> "Simultaneous PUT requests for the same account..."

 *   Fix https://review.openstack.org/#/c/57019/<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/%23/c/57019/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=b3fd7e00ff783dd5cde348582341c49612e21e3f8ad277b3f8a846481e53b260>

 *   This was not on our original list of bugs, but while in bug fix mode, we got this one fixed as well

 *   https://bugs.launchpad.net/bugs/1251784<https://urldefense.proofpoint.com/v1/url?u=https://bugs.launchpad.net/bugs/1251784&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=e4e0bd8da2ad1660630e2c1e3aed74e254e4d2598ca5e9f9b0a15cc6f7410da7> "nova+neutron scheduling error: Connection to neutron failed: Maximum attempts reached

 *   Fix https://review.openstack.org/#/c/57509/<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/%23/c/57509/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=d9292c43d1a9cedcc30e3999ba207205be5a04cf9e97533d696022a98b7e8176>

 *   Uncovered on mailing list (http://lists.openstack.org/pipermail/openstack-dev/2013-November/019906.html)<https://urldefense.proofpoint.com/v1/url?u=http://lists.openstack.org/pipermail/openstack-dev/2013-November/019906.html%29&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=363f5072bf0421f93cef5d8bb90716b87cb33256dc75310736d6a45f5b973dd4>

 *   Nova had a very old version of oslo's local.py which is used for managing references to local variables in coroutines. The old version had a pretty significant bug that basically meant non-weak references to variables were not managed properly. This fix has made the nova neutron interactions much more reliable.

 *   This fixed the number 2 bug on our list of top gate bugs (http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html<https://urldefense.proofpoint.com/v1/url?u=http://lists.openstack.org/pipermail/openstack-dev/2013-November/019826.html&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=1ff60e1563f23e385d9c97138f112a25a6e9249546b22d72aceb5b7de4cf7e1d> )!

In addition to fixing 4 old bugs, we fixed two new bugs that were introduced / exposed this week.


 *   https://bugs.launchpad.net/bugs/1251920<https://urldefense.proofpoint.com/v1/url?u=https://bugs.launchpad.net/bugs/1251920&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=f470d1562c4087bda5648c25ba46f02231da164527167026f12303b234a760e4> "Tempest failures due to failure to return console logs from an instance Project"

 *   Bug: https://review.openstack.org/#/c/54363/<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/%23/c/54363/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=bfd1e281f1ab62e84ab67a0f83de299e9257faa8456c11cd8e164b1f7f17d6ac> [Tempest]

 *   Fix(work around): https://review.openstack.org/#/c/57193/<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/%23/c/57193/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=7860f54bd3e0affa95232b19086252adeefab1ca8ab09c4abd313652b2fa6696>

 *   After many false starts and banging our head against the wall, we identified a change to tempest, https://review.openstack.org/54363<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/54363&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=73be1bab2b6c7eabae8e763c41b522fecb43e7bb95a429b05370c884c8f06e63> , that added a new test around the same time as bug 1251920 became a problem. Forcing tempest to skip this test had a very high incidence of success without any 1251920 related failures. As a result we are working arond this bug by skipping that test, until it can be run without major impact to the gate.

 *   The change that introduced this problematic test had to go through the gate four times before it would merge, though only one of the 3 failed attemps appears to have triggered 1251920.  Or as  Jeremy Stanley  (fungi) said "nondeterministic failures breed more nondeterministic failures, because people are so used to having to reverify their patches to get them to merge that they are doing so even when it's their patch which is introducing a nondeterministic bug."

 *   https://bugs.launchpad.net/bugs/1252170<https://urldefense.proofpoint.com/v1/url?u=https://bugs.launchpad.net/bugs/1252170&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=d9c36da74a03ddf8fb78746abcead4af5ad2aacefd97a9e7cad25c408df1233d> "tempest.scenario test_resize_server_confirm failed in grenade"

 *   Fix https://review.openstack.org/#/c/57357/<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/%23/c/57357/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=d2f9d793031381228ae0117b4108e36254d64108626a7f66ae72612b72a3ac7e>

 *   Fix https://review.openstack.org/#/c/57572/<https://urldefense.proofpoint.com/v1/url?u=https://review.openstack.org/%23/c/57572/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=b008b7e05e210bc03a041f2f87271024c608febe167e0cccb3fb473ad882b3b6>

 *   First we started running post Grenade upgrade tests in parallel (to fix another bug) which would normally be fine, but Grenade wasn't configuring the small flavors typically used by tempest so it was possible for the devstack Jenkins slaves to run out of memory when starting many larger VMs in parallel. To fix this devstack lib/tempest has been updated to create the flavors only if they don't exist and Grenade is allowing tempest to use its default instance flavors.


Now that we have the gate back into working order, we are working on the next steps to prevent this from happening again.  The two most immediate changes are:

 *   Doing a better job of triaging gate bugs  (http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html<https://urldefense.proofpoint.com/v1/url?u=http://lists.openstack.org/pipermail/openstack-dev/2013-November/020048.html&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=eH0pxTUZo8NPZyF6hgoMQu%2BfDtysg45MkPhCZFxPEq8%3D%0A&m=pqWKwDky3vaQZGG9oFXeYFK9Gpc6a%2F1ctVmMdYuoSY4%3D%0A&s=13f3d7bd81ad875c377c4fb080858b83ca7da7e9b8a86ab629e320faa7e3e235> ).

 *   In the next few days we will remove  'reverify no bug' (although you will still be able to run 'reverify bug x'.

Best,
Joe Gordon
Clark Boylan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131124/400d50a7/attachment-0001.html>


More information about the OpenStack-dev mailing list