[openstack-dev] [nova] A modest proposal to reduce reviewer load
Mark McLoughlin
markmc at redhat.com
Thu Jun 19 07:32:25 UTC 2014
Hi Armando,
On Tue, 2014-06-17 at 14:51 +0200, Armando M. wrote:
> I wonder what the turnaround of trivial patches actually is, I bet you
> it's very very small, and as Daniel said, the human burden is rather
> minimal (I would be more concerned about slowing them down in the
> gate, but I digress).
>
> I think that introducing a two-tier level for patch approval can only
> mitigate the problem, but I wonder if we'd need to go a lot further,
> and rather figure out a way to borrow concepts from queueing theory so
> that they can be applied in the context of Gerrit. For instance
> Little's law [1] says:
>
> "The long-term average number of customers (in this context reviews)
> in a stable system L is equal to the long-term average effective
> arrival rate, λ, multiplied by the average time a customer spends in
> the system, W; or expressed algebraically: L = λW."
>
> L can be used to determine the number of core reviewers that a project
> will need at any given time, in order to meet a certain arrival rate
> and average time spent in the queue. If the number of core reviewers
> is a lot less than L then that core team is understaffed and will need
> to increase.
>
> If we figured out how to model and measure Gerrit as a queuing system,
> then we could improve its performance a lot more effectively; for
> instance, this idea of privileging trivial patches over longer patches
> has roots in a popular scheduling policy [3] for M/G/1 queues, but
> that does not really help aging of 'longer service time' patches and
> does not have a preemption mechanism built-in to avoid starvation.
>
> Just a crazy opinion...
> Armando
>
> [1] - http://en.wikipedia.org/wiki/Little's_law
> [2] - http://en.wikipedia.org/wiki/Shortest_job_first
> [3] - http://en.wikipedia.org/wiki/M/G/1_queue
This isn't crazy at all. We do have a problem that surely could be
studied and solved/improved by applying queueing theory or lessons from
fields like lean manufacturing. Right now, we're simply applying our
intuition and the little I've read about these sorts of problems is that
your intuition can easily take you down the wrong path.
There's a bunch of things that occur just glancing through those
articles:
- Do we have an unstable system? Would it be useful to have arrival
and exit rate metrics to help highlight this? Over what time period
would those rates need to be averaged to be useful? Daily, weekly,
monthly, an entire release cycle?
- What are we trying to optimize for? The length of time in the
queue? The number of patches waiting in the queue? The response
time to a new patch revision?
- We have a single queue, with a bunch of service nodes with a wide
variance between their service rates, very little in the way of
scheduling policy, a huge rate of service nodes sending jobs back
for rework, a cost associated with maintaining a job while it sits
in the queue, the tendency for some jobs to disrupt many other jobs
with merge conflicts ... not simple.
- Is there any sort of natural limit in our queue size that makes the
system stable - e.g. do people naturally just stop submitting
patches at some point?
My intuition on all of this lately is that we need some way to model and
experiment with this queue, and I think we could make some interesting
progress if we could turn it into a queueing network rather than a
single, extremely complex queue.
Say we had a front-end for gerrit which tracked which queue a patch is
in, we could experiment with things like:
- a triage queue, with non-cores signed up as triagers looking for
obvious mistakes and choosing the next queue for a patch to enter
into
- queues having a small number of cores signed up as owners - e.g.
high priority bugfix, API, scheduler, object conversion, libvirt
driver, vmware driver, etc.
- we'd allow for a large number of queues so that cores could aim for
an "inbox zero" approach on individual queues, something that would
probably help keep cores motivated.
- we could apply different scheduling policies to each of the
different queues - i.e. explicit guidance for cores about which
patches they should pick off the queue next.
- we could track metrics on individual queues as well as the whole
network, identifying bottlenecks and properly recognizing which
reviewers are doing a small number of difficult reviews versus
those doing a high number of trivial reviews.
- we could require some queues to feed into a final approval queue
where some people are responsible for giving an approved patch a
final sanity check - i.e. there would be a class of reviewer with
good instincts who quickly churn through already-reviewed patches
looking for the kind of mistakes people tend to mistake when
they're down in the weeds.
- explicit queues for large, cross-cutting changes like coding style
changes. Perhaps we could stop servicing these queues at certain
points in the cycles, or reduce the rate at which they are
serviced.
- we could include specs and client patches in the same network so
that they prioritized in the same way.
Lots of ideas, none of it is trivial ... but perhaps it'll spark
someone's interest :)
Mark.
More information about the OpenStack-dev
mailing list