[OpenStack-Infra] An idea to scale Zuul
James E. Blair
jeblair at openstack.org
Thu Jan 9 06:59:45 UTC 2014
Hi,
When Zuul gets very busy, it can end up launching hundreds of jobs
nearly simultaneously. Each of them has to perform several git fetch
operations to obtain the changes needed for testing. They fetch from
the git repos on the Zuul server because Zuul itself is creating those
commits by locally merging several changes together according to what's
in the queue.
The acts of fetching and merging git patchsets (which is single
threaded) adds some load to the server, but in particular, serving those
git refs to 400 Jenkins nodes nearly simultaneously can also be a bit of
a burden. It was too much for our previous server; we've moved Zuul to
a faster server now, but it would be nice to have a more scalable
solution for the future.
I'd like to move the Zuul git merging component into a separate process
that can be located on a separate host (or hosts) and scaled out.
The current zuul-server would continue to manage the queue and launch
jobs, but as it processes the queue and decides which changes should be
composed and built into zuul git refs, it would package the info about
each ref and put it on the gearman queue as a work item. An instance of
the new component (zuul-merger) would fetch that job and fetch the
needed refs from Gerrit, and merge them. It would also serve the
resulting git repo in the same way that Zuul does now.
Zuul would not have to wait for a response before continuing to process
the queue, and since it's not doing any actual work, will be able to
move through the queue _much_ faster than currently. Once Zuul _does_
receive a completion response from a zuul-merger, it can then launch the
jobs for that change. It will pass the URL for that particular
zuul-merger (as ZUUL_URL) to the jobs so that they know from which
merger to fetch the zuul ref. We can also use the cancel job
functionality in gearman if Zuul decides to reorder the queue.
We can scale out the mergers horizontally and they can operate in
parallel, which should also improve the responsiveness of overall queue
processing.
The only downside I currently foresee is that if we scale out the
mergers too much, we will see a performance impact on gerrit; therefore
we should anticipate having a reasonably small number of these (2-8,
perhaps).
Since this is already quite modular, I think the implementation should
be relatively simple.
How does that sound?
-Jim
More information about the OpenStack-Infra
mailing list