Acknowledging Kolla is in the top 5. Deployment projects certainly tend to consume resources. I'll raise this at our next meeting and see what we can come up with.
Thanks - at least knowing and acknowledging is a great first step :)
7. Improve the reliability of jobs. Especially voting and gating ones. Rechecks increase resource usage and time to results/merge. I found querying the zuul API for failed jobs in the gate pipeline is a good way to find unexpected failures.
For sure, and thanks for pointing this out. As mentioned in the Neutron example, 70some hours becomes 140some hours if the patch needs a couple rechecks. Rechecks due to spurious job failures reduce capacity and increase latency for everyone.
8. Reduce the node count in multi node jobs.
Yeah, I hope that people with three or more nodes in a job are doing so with lots of good reasoning, but this is an important point. Multi-node jobs consume N nodes for the full job runtime, but could be longer. If only some of the nodes are initially available, I believe zuul will spin those workers up and then wait for more, which means you are just burning node time not doing anything. I'm sure job configuration and other zuul details cause this to vary a lot (and I'm not an expert here), but it's good to note that fewer node counts will reduce the likelihood of the problem. --Dan