Re: [all] Gate resources and performance

4 Feb 2021


      ...
Acknowledging Kolla is in the top 5. Deployment projects certainly
tend to consume resources. I'll raise this at our next meeting and see
what we can come up with.
Thanks - at least knowing and acknowledging is a great first step :)
...
7. Improve the reliability of jobs. Especially voting and gating
ones. Rechecks increase resource usage and time to results/merge.  I
found querying the zuul API for failed jobs in the gate pipeline is a
good way to find unexpected failures.
For sure, and thanks for pointing this out. As mentioned in the Neutron
example, 70some hours becomes 140some hours if the patch needs a couple
rechecks. Rechecks due to spurious job failures reduce capacity and
increase latency for everyone.
...
8. Reduce the node count in multi node jobs.
Yeah, I hope that people with three or more nodes in a job are doing so
with lots of good reasoning, but this is an important point. Multi-node
jobs consume N nodes for the full job runtime, but could be longer. If
only some of the nodes are initially available, I believe zuul will spin
those workers up and then wait for more, which means you are just
burning node time not doing anything. I'm sure job configuration and
other zuul details cause this to vary a lot (and I'm not an expert
here), but it's good to note that fewer node counts will reduce the
likelihood of the problem.

--Dan

Re: [all] Gate resources and performance

Dan Smith