[openstack-dev] Gate Math or why you you keep typing 'recheck'
joe.gordon0 at gmail.com
Thu Nov 14 03:15:44 UTC 2013
TL;DR: Failure rate for gate jobs in graphite
I am sure many of you are wondering why you keep having to type 'recheck
bug x' all the time (I know I am), so I will try to answer that question
Just before releasing Havana we started elastic-recheck to get a better
grasp on what transient issues the gate is having. This has helped us
classify the types of bugs we have and how often they occur but it
doesn't completely explain why the gate appears to fail so often.
Assuming all tests are independent, the probability that you will need to
run a recheck, is the sum of all tests and each patch commonly has several
revisions so a fairly low failure rate can quickly cause you to use a
Or in a simple equation:
percent_need_a_recheck_per_review = failure_rate * tempest_jobs *
It turns out we have a graphite server, and after spending too much time on
it, below is the percent failure rate for:
So with each job failing between 5 to 10% of the time.
now to estimate percent_need_a_recheck_per_review.
- 2 revisions + 1 gate run,
- only count big tempest runs: full, neutron, postgres-full
- failure_rate of 5%
percent_need_a_recheck_per_review = 0.05 * 3 * 3 = 45%
So on a good day you may only have to run a recheck on just under half of
- 5 revisions + 1 gate run,
- count gating tests that runs tempest: full, neutron, postgres-full,
- failure_rate of 10%
percent_need_a_recheck_per_review = 0.10 * 5 * 6 = 300%
But on a bad day you may need 3 rechecks to get your patch merged!
In short, even tiny bugs in gate have a major impact on the stability of
gate! And as we grow the number of integrated projects and increase the
number of tests this pattern will only get worse.
 http://tinyurl.com/mqju53r <http://tinyurl.com/mqju53r>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev