[openstack-dev] Gate Math or why you you keep typing 'recheck'

Joe Gordon joe.gordon0 at gmail.com
Thu Nov 14 03:15:44 UTC 2013


Hi All,

TL;DR: Failure rate for gate jobs in graphite
http://tinyurl.com/mqju53r <http://tinyurl.com/mqju53r>

I am sure many of you are wondering why you keep having to type 'recheck
bug x' all the time (I know I am), so I will try to answer that question
here.

Just before releasing Havana we started elastic-recheck to get a better
grasp on what transient issues the gate is having. This has helped us
classify the types of bugs we have and how often they occur[1] but it
doesn't completely explain why the gate appears to fail so often.

Assuming all tests are independent, the probability that you will need to
run a recheck, is the sum of all tests and each patch commonly has several
revisions so a fairly low failure rate can quickly cause you to use a
recheck.

Or in a simple equation:

percent_need_a_recheck_per_review  = failure_rate * tempest_jobs *
patch_revisions


It turns out we have a graphite server, and after spending too much time on
it, below[2] is the percent failure rate for:
* gate-tempest-devstack-vm-full
* gate-tempest-devstack-vm-neutron
* check-tempest-devstack-vm-neutron
* check-tempest-devstack-vm-full

So with each job failing between 5 to 10% of the time.

now to estimate percent_need_a_recheck_per_review.

lower bound
=========
assumptions:
  - 2 revisions + 1 gate run,
  -  only count big tempest runs: full, neutron, postgres-full
  - failure_rate of 5%
percent_need_a_recheck_per_review = 0.05 * 3 * 3 = 45%

So on a good day you may only have to run a recheck on just under half of
your reviews

upper bound
=========
assumptions:
  - 5 revisions + 1 gate run,
  -  count gating tests that runs tempest: full, neutron, postgres-full,
large-ops, grenade
  - failure_rate of 10%
percent_need_a_recheck_per_review = 0.10 * 5 * 6 = 300%

But on a bad day you may need 3 rechecks to get your patch merged!


In short, even tiny bugs in gate have a major impact on the stability of
gate!  And as we grow the number of integrated projects and increase the
number of tests this pattern will only get worse.


[1] http://status.openstack.org/elastic-recheck/
[2] http://tinyurl.com/mqju53r  <http://tinyurl.com/mqju53r>


Best,
Joe Gordon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131113/9ac366ce/attachment.html>


More information about the OpenStack-dev mailing list