On 2021-02-04 12:49:02 -0800 (-0800), Dan Smith wrote: [...]
If only some of the nodes are initially available, I believe zuul will spin those workers up and then wait for more, which means you are just burning node time not doing anything. [...]
I can imagine some pathological situations where this might be the case occasionally, but for the most part they come up around the same time. At risk of diving into too much internal implementation detail, here's the typical process at work: 1. The Zuul scheduler determines that it needs to schedule a build of your job, checks the definition to determine how many of which sorts of nodes that will require, and then puts a node request into Zookeeper with those details. 2. A Nodepool launcher checks for pending requests in Zookeeper, sees the one for your queued build, and evaluates whether it has a provider with the right labels and sufficient available quota to satisfy this request (and if not, skips it in hopes another launcher can instead). 3. If that launcher decides to attempt to fulfil the request, it issues parallel server create calls in the provider it chose, then waits for them to become available and reachable over the Internet. 4. Once the booted nodes are reachable, the launcher returns the request in Zookeeper and the node records are locked for use in the assigned build until it completes. Even our smallest providers have dozens of instances worth of capacity, and most multi-node jobs use only two or maybe three nodes for a build (granted I've seen some using five); so with the constant churn in builds completing and releasing spent nodes for deletion, there shouldn't be a significant amount of time spent where quota is consumed by some already active instances awaiting their compatriots for the same node request to also reach a ready state (though if the provider has a high incidence of boot failures, this becomes increasingly likely because some server create calls will need to be reissued). Where this gets a little more complicated is with dependent jobs, as Zuul requires they all be satisfied from the same provider. Certainly a large set of interdependent multi-node jobs becomes harder to choose a provider for and needs to wait longer for enough capacity to be freed there. -- Jeremy Stanley