The fundamental downside to these sorts of defensive approaches is that they make it easier to avoid solving the underlying issues.
We certainly don't want to incentivize relying on aggregate throughput in place of actually making things faster and better. That's why I started this thread. However...
We've designed Zuul to perform most efficiently when it's running tests which are deterministic and mostly free of "false negative" failures. Making your tests and the software being tested efficient and predictable maximizes CI throughput under such an optimistic model.
This is a nice ideal and definitely what we should strive for, no doubt. But I think it's pretty clear that what we're doing here is hard, with potential failures at all layers above and below a thing you're working on at any given point. Striving to get there and expecting we ever will are very different. I remember back when we moved from serialized tests to parallel ones, there was a lot of concern over being able to reproduce a test failure that only occasionally happens due to ordering. The benefit of running in parallel greatly outweighs the cost of not doing so. Still today, it is incredibly time consuming to reproduce, debug and fix issues that come from running in parallel. Our tests are more complicated (but better of course) because of it, and just yesterday I -1'd a patch because I could spot some non-reentrant behavior it was proposing to add. In terms of aggregate performance, we get far more done I'm sure with parallelized tests along with some increased spurious failure rate, over a very low failure rate and serialized tests.
Sinking engineering effort into workarounds for unstable tests and buggy software is time which could have been invested in improving things instead, but also to a great extent removes a lot of the incentive to bother.
Like everything, it's a tradeoff. If we didn't run in parallel, we'd waste a lot more gate resources in serial, but we would almost definitely have to recheck less, our tests could be a lot simpler and we could spend time (and be rewarded in test execution) by making the actual servers faster instead of debugging failures. You might even argue that such an arrangement would benefit the users more than making our tests capable of running in parallel ;)
Sure it could be seen as a pragmatic approach, accepting that in a large software ecosystem such seemingly pathological problems are actually inevitable, but that strikes me as a bit defeatist. There will of course always be temporary problems resulting from outages/incidents in donated resources or regressions in external dependencies outside our control, but if our background failure rate was significantly reduced it would also be far easier to spot and mitigate an order of magnitude failure increase quickly, rather than trying to find the cause of a sudden 25% uptick in failures.
Looking back on the eight years I've been doing this, I really don't think that zero fails is realistic or even useful as a goal, unless it's your only goal. Thus, I expect we're always going to be ticking up or down over time. Debugging and fixing the non-trivial things that plague us is some of the harder work we do, more so in almost all cases than the work we did that introduced the problem in the first place. We definitely need to be constantly trying to increase stability, but let's be clear that it is likely the _most_ difficult think a stacker can do with their time. --Dan