[openstack-dev] Gate Math or why you you keep typing 'recheck'

Robert Collins robertc at robertcollins.net
Thu Nov 14 03:30:31 UTC 2013

On 14 November 2013 16:15, Joe Gordon <joe.gordon0 at gmail.com> wrote:
> Hi All,
> TL;DR: Failure rate for gate jobs in graphite http://tinyurl.com/mqju53r
> In short, even tiny bugs in gate have a major impact on the stability of
> gate!  And as we grow the number of integrated projects and increase the
> number of tests this pattern will only get worse.

Thanks for the analysis!

I have two comments (yes, only two!)

Firstly, 5% isn't a tiny bug. It's a huge bug. We're doing thousands
of runs a day. A tiny bug IMO 0.01% occurrence rate or less. Lets
recalibrate our head around failure rates:
a 0.01% failure in a 10K node cloud doing deploys once a day will
happen every day (on average :)).

Secondly, Google in their testing talks say they've basically given up
on the idea that they can eliminate all such issues in automated tests
- in their opinion it's an engineering tradeoff... I think we can do
better :) - I'd like to see us start running 5 or 10 duplicate
scenarios to set a lower bound on flakey tests that can enter the
system /at all/, and to look for and back out changes that introduce
more subtle flakey bugs.


Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud

