<div dir="ltr">Hi All,<div><br></div><div>TL;DR: Failure rate for gate jobs in graphite <a href="http://tinyurl.com/mqju53r">http://tinyurl.com/mqju53r </a><br><div><br></div><div>I am sure many of you are wondering why you keep having to type 'recheck bug x' all the time (I know I am), so I will try to answer that question here.<br>
</div><div><br></div><div>Just before releasing Havana we started elastic-recheck to get a better grasp on what transient issues the gate is having. This has helped us classify the types of bugs we have and how often they occur[1] but it doesn't completely explain why the gate appears to fail so often.</div>
<div><br></div><div>Assuming all tests are independent, the probability that you will need to run a recheck, is the sum of all tests and each patch commonly has several revisions so a fairly low failure rate can quickly cause you to use a recheck.</div>
<div><br></div><div>Or in a simple equation:</div><div><br></div><div>percent_need_a_recheck_per_review = failure_rate * tempest_jobs * patch_revisions</div><div> </div><div><br></div><div>It turns out we have a graphite server, and after spending too much time on it, below[2] is the percent failure rate for:</div>
<div><font color="#000000">* <span style="font-family:sans-serif;font-size:13px">gate-tempest-devstack-vm-full</span></font></div><div><div><font color="#000000">* <span style="font-family:sans-serif;font-size:13px">gate-tempest-devstack-vm-neutron</span></font></div>
</div><div><div><div><font color="#000000">* <span style="font-family:sans-serif;font-size:13px">check-tempest-devstack-vm-neutron</span></font></div></div></div><div><div><font color="#000000">* <span style="font-family:sans-serif;font-size:13px">check-tempest-devstack-vm-full</span></font></div>
<div><font color="#000000"><span style="font-family:sans-serif;font-size:13px"><br></span></font></div><div><font color="#000000"><span style="font-family:sans-serif;font-size:13px">So with each job failing between 5 to 10% of the time.</span></font></div>
<div><br></div><div><font color="#000000" face="sans-serif">now to estimate </font>percent_need_a_recheck_per_review.</div><div><br></div><div>lower bound</div><div>=========</div><div>assumptions:</div><div> - 2 revisions + 1 gate run, </div>
<div> - only count big tempest runs: full, neutron, postgres-full</div><div> - failure_rate of 5%</div><div>percent_need_a_recheck_per_review = 0.05 * 3 * 3 = 45%<font color="#000000"><span style="font-family:sans-serif;font-size:13px"><br>
</span></font></div><div><br></div><div>So on a good day you may only have to run a recheck on just under half of your reviews</div><div><br></div><div><div>upper bound</div><div>=========</div><div>assumptions:</div><div>
- 5 revisions + 1 gate run, </div><div> - count gating tests that runs tempest: full, neutron, postgres-full, large-ops, grenade</div><div> - failure_rate of 10%</div><div><div>percent_need_a_recheck_per_review = 0.10 * 5 * 6 = 300%</div>
</div></div></div><div><br></div><div>But on a bad day you may need 3 rechecks to get your patch merged!</div><div><br></div><div><br></div><div>In short, even tiny bugs in gate have a major impact on the stability of gate! And as we grow the number of integrated projects and increase the number of tests this pattern will only get worse.</div>
<div><br></div><div><br></div><div>[1] <a href="http://status.openstack.org/elastic-recheck/">http://status.openstack.org/elastic-recheck/</a></div><div>[2] <a href="http://tinyurl.com/mqju53r">http://tinyurl.com/mqju53r </a></div>
</div><div><br></div><div><br></div><div>Best,</div><div>Joe Gordon</div></div>