On 2/25/19 2:36 PM, Eric Fried wrote:
-1 to serializing jobs with stop-on-first-failure. Human time (having to iterate fixes one failed job at a time) is more valuable than computer time. That's why we make computers. If you want quick feedback on fast-running jobs (that are running in parallel with slower-running jobs), zuul.o.o is available and easy to use.
In general I agree with this sentiment. However, I do think there comes a point where we'd be penny-wise and pound-foolish. If we're talking about 5 minute unit test jobs I'm not sure how much human time you're actually losing by serializing behind them, but you may be saving significant amounts of computer time. If we're talking about sufficient gains in gate throughput it might be worth it to lose 5 minutes here or there and in other cases save a couple of hours by not waiting in a long queue behind jobs on patches that are unmergeable anyway. That said, I wouldn't push too hard in either direction until someone crunched the numbers and figured out how much time it would have saved to not run long tests on patch sets with failing unit tests. I feel like it's probably possible to figure that out, and if so then we should do it before making any big decisions on this.
If we wanted to get more efficient about our CI resources, there are other possibilities I would prefer to see tried first. For example, do we need a whole separate node to run each unit & functional job, or could we run them in parallel (or even serially, since all together they would probably still take less time than e.g. a tempest) on a single node?
I would also support a commit message tag (or something) that tells zuul not to bother running CI right now. Or a way to go to zuul.o.o and yank a patch out.
Realizing of course that these suggestions come from someone who uses zuul in the most superficial way possible (like, I wouldn't know how to write a... job? playbook? with a gun to my head) so they're probably exponentially harder than using the thing Chris mentioned.
-efried