Re: [all] Gate resources and performance

4 Feb 2021

      On 2021-02-04 12:49:02 -0800 (-0800), Dan Smith wrote:
[...]
...
If only some of the nodes are initially available, I believe zuul
will spin those workers up and then wait for more, which means you
are just burning node time not doing anything.
[...]
I can imagine some pathological situations where this might be the
case occasionally, but for the most part they come up around the
same time. At risk of diving into too much internal implementation
detail, here's the typical process at work:

1. The Zuul scheduler determines that it needs to schedule a build
   of your job, checks the definition to determine how many of which
   sorts of nodes that will require, and then puts a node request
   into Zookeeper with those details.

2. A Nodepool launcher checks for pending requests in Zookeeper,
   sees the one for your queued build, and evaluates whether it has
   a provider with the right labels and sufficient available quota
   to satisfy this request (and if not, skips it in hopes another
   launcher can instead).

3. If that launcher decides to attempt to fulfil the request, it
   issues parallel server create calls in the provider it chose,
   then waits for them to become available and reachable over the
   Internet.

4. Once the booted nodes are reachable, the launcher returns the
   request in Zookeeper and the node records are locked for use in
   the assigned build until it completes.

Even our smallest providers have dozens of instances worth of
capacity, and most multi-node jobs use only two or maybe three nodes
for a build (granted I've seen some using five); so with the
constant churn in builds completing and releasing spent nodes for
deletion, there shouldn't be a significant amount of time spent
where quota is consumed by some already active instances awaiting
their compatriots for the same node request to also reach a ready
state (though if the provider has a high incidence of boot failures,
this becomes increasingly likely because some server create calls
will need to be reissued).

Where this gets a little more complicated is with dependent jobs, as
Zuul requires they all be satisfied from the same provider.
Certainly a large set of interdependent multi-node jobs becomes
harder to choose a provider for and needs to wait longer for enough
capacity to be freed there.
-- 
Jeremy Stanley