[openstack-dev] Gate Math or why you you keep typing 'recheck'
Joe Gordon
joe.gordon0 at gmail.com
Thu Nov 14 03:15:44 UTC 2013
Hi All,
TL;DR: Failure rate for gate jobs in graphite
http://tinyurl.com/mqju53r <http://tinyurl.com/mqju53r>
I am sure many of you are wondering why you keep having to type 'recheck
bug x' all the time (I know I am), so I will try to answer that question
here.
Just before releasing Havana we started elastic-recheck to get a better
grasp on what transient issues the gate is having. This has helped us
classify the types of bugs we have and how often they occur[1] but it
doesn't completely explain why the gate appears to fail so often.
Assuming all tests are independent, the probability that you will need to
run a recheck, is the sum of all tests and each patch commonly has several
revisions so a fairly low failure rate can quickly cause you to use a
recheck.
Or in a simple equation:
percent_need_a_recheck_per_review = failure_rate * tempest_jobs *
patch_revisions
It turns out we have a graphite server, and after spending too much time on
it, below[2] is the percent failure rate for:
* gate-tempest-devstack-vm-full
* gate-tempest-devstack-vm-neutron
* check-tempest-devstack-vm-neutron
* check-tempest-devstack-vm-full
So with each job failing between 5 to 10% of the time.
now to estimate percent_need_a_recheck_per_review.
lower bound
=========
assumptions:
- 2 revisions + 1 gate run,
- only count big tempest runs: full, neutron, postgres-full
- failure_rate of 5%
percent_need_a_recheck_per_review = 0.05 * 3 * 3 = 45%
So on a good day you may only have to run a recheck on just under half of
your reviews
upper bound
=========
assumptions:
- 5 revisions + 1 gate run,
- count gating tests that runs tempest: full, neutron, postgres-full,
large-ops, grenade
- failure_rate of 10%
percent_need_a_recheck_per_review = 0.10 * 5 * 6 = 300%
But on a bad day you may need 3 rechecks to get your patch merged!
In short, even tiny bugs in gate have a major impact on the stability of
gate! And as we grow the number of integrated projects and increase the
number of tests this pattern will only get worse.
[1] http://status.openstack.org/elastic-recheck/
[2] http://tinyurl.com/mqju53r <http://tinyurl.com/mqju53r>
Best,
Joe Gordon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131113/9ac366ce/attachment.html>
More information about the OpenStack-dev
mailing list