Re: [all] Gate resources and performance

5 Feb 2021

      ...
The fundamental downside to these sorts of defensive approaches is
that they make it easier to avoid solving the underlying issues.
We certainly don't want to incentivize relying on aggregate throughput
in place of actually making things faster and better. That's why I
started this thread. However...
...
We've designed Zuul to perform most efficiently when it's running
tests which are deterministic and mostly free of "false negative"
failures. Making your tests and the software being tested efficient
and predictable maximizes CI throughput under such an optimistic
model.
This is a nice ideal and definitely what we should strive for, no
doubt. But I think it's pretty clear that what we're doing here is hard,
with potential failures at all layers above and below a thing you're
working on at any given point. Striving to get there and expecting we
ever will are very different.

I remember back when we moved from serialized tests to parallel ones,
there was a lot of concern over being able to reproduce a test failure
that only occasionally happens due to ordering. The benefit of running
in parallel greatly outweighs the cost of not doing so. Still today, it
is incredibly time consuming to reproduce, debug and fix issues that
come from running in parallel. Our tests are more complicated (but
better of course) because of it, and just yesterday I -1'd a patch
because I could spot some non-reentrant behavior it was proposing to
add. In terms of aggregate performance, we get far more done I'm sure
with parallelized tests along with some increased spurious failure rate,
over a very low failure rate and serialized tests.
...
Sinking engineering effort into workarounds for unstable tests and
buggy software is time which could have been invested in improving
things instead, but also to a great extent removes a lot of the
incentive to bother.
Like everything, it's a tradeoff. If we didn't run in parallel, we'd
waste a lot more gate resources in serial, but we would almost
definitely have to recheck less, our tests could be a lot simpler and we
could spend time (and be rewarded in test execution) by making the
actual servers faster instead of debugging failures. You might even
argue that such an arrangement would benefit the users more than making
our tests capable of running in parallel ;)
...
Sure it could be seen as a pragmatic approach, accepting that in a
large software ecosystem such seemingly pathological problems are
actually inevitable, but that strikes me as a bit defeatist. There
will of course always be temporary problems resulting from
outages/incidents in donated resources or regressions in external
dependencies outside our control, but if our background failure rate
was significantly reduced it would also be far easier to spot and
mitigate an order of magnitude failure increase quickly, rather than
trying to find the cause of a sudden 25% uptick in failures.
Looking back on the eight years I've been doing this, I really don't
think that zero fails is realistic or even useful as a goal, unless it's
your only goal. Thus, I expect we're always going to be ticking up or
down over time. Debugging and fixing the non-trivial things that plague
us is some of the harder work we do, more so in almost all cases than
the work we did that introduced the problem in the first place. We
definitely need to be constantly trying to increase stability, but let's
be clear that it is likely the _most_ difficult think a stacker can do
with their time.

--Dan

Re: [all] Gate resources and performance

Dan Smith