<div dir="ltr">Hi All,<div><br></div><div>TL;DR: Failure rate for gate jobs in graphite <a href="http://tinyurl.com/mqju53r">http://tinyurl.com/mqju53r </a><br><div><br></div><div>I am sure many of you are wondering why you keep having to type 'recheck bug x' all the time (I know I am), so I will try to answer that question here.<br>


</div><div><br></div><div>Just before releasing Havana we started elastic-recheck to get a better grasp on what transient issues the gate is having. This has helped us classify the types of bugs we have and how often they occur[1] but it doesn't completely explain why the gate appears to fail so often.</div>


<div><br></div><div>Assuming all tests are independent, the probability that you will need to run a recheck, is the sum of all tests and each patch commonly has several revisions so a fairly low failure rate can quickly cause you to use a recheck.</div>


<div><br></div><div>Or in a simple equation:</div><div><br></div><div>percent_need_a_recheck_per_review  = failure_rate * tempest_jobs * patch_revisions</div><div> </div><div><br></div><div>It turns out we have a graphite server, and after spending too much time on it, below[2] is the percent failure rate for:</div>


<div><font color="#000000">* <span style="font-family:sans-serif;font-size:13px">gate-tempest-devstack-vm-full</span></font></div><div><div><font color="#000000">* <span style="font-family:sans-serif;font-size:13px">gate-tempest-devstack-vm-neutron</span></font></div>


</div><div><div><div><font color="#000000">* <span style="font-family:sans-serif;font-size:13px">check-tempest-devstack-vm-neutron</span></font></div></div></div><div><div><font color="#000000">* <span style="font-family:sans-serif;font-size:13px">check-tempest-devstack-vm-full</span></font></div>


<div><font color="#000000"><span style="font-family:sans-serif;font-size:13px"><br></span></font></div><div><font color="#000000"><span style="font-family:sans-serif;font-size:13px">So with each job failing between 5 to 10% of the time.</span></font></div>


<div><br></div><div><font color="#000000" face="sans-serif">now to estimate </font>percent_need_a_recheck_per_review.</div><div><br></div><div>lower bound</div><div>=========</div><div>assumptions:</div><div>  - 2 revisions + 1 gate run, </div>


<div>  -  only count big tempest runs: full, neutron, postgres-full</div><div>  - failure_rate of 5%</div><div>percent_need_a_recheck_per_review = 0.05 * 3 * 3 = 45%<font color="#000000"><span style="font-family:sans-serif;font-size:13px"><br>


</span></font></div><div><br></div><div>So on a good day you may only have to run a recheck on just under half of your reviews</div><div><br></div><div><div>upper bound</div><div>=========</div><div>assumptions:</div><div>


  - 5 revisions + 1 gate run, </div><div>  -  count gating tests that runs tempest: full, neutron, postgres-full, large-ops, grenade</div><div>  - failure_rate of 10%</div><div><div>percent_need_a_recheck_per_review = 0.10 * 5 * 6 = 300%</div>


</div></div></div><div><br></div><div>But on a bad day you may need 3 rechecks to get your patch merged!</div><div><br></div><div><br></div><div>In short, even tiny bugs in gate have a major impact on the stability of gate!  And as we grow the number of integrated projects and increase the number of tests this pattern will only get worse.</div>


<div><br></div><div><br></div><div>[1] <a href="http://status.openstack.org/elastic-recheck/">http://status.openstack.org/elastic-recheck/</a></div><div>[2] <a href="http://tinyurl.com/mqju53r">http://tinyurl.com/mqju53r </a></div>


</div><div><br></div><div><br></div><div>Best,</div><div>Joe Gordon</div></div>