Open Stack

Thu Jun 19 07:32:25 UTC 2014

Hi Armando,

On Tue, 2014-06-17 at 14:51 +0200, Armando M. wrote:
> I wonder what the turnaround of trivial patches actually is, I bet you
> it's very very small, and as Daniel said, the human burden is rather
> minimal (I would be more concerned about slowing them down in the
> gate, but I digress).

> 
> I think that introducing a two-tier level for patch approval can only
> mitigate the problem, but I wonder if we'd need to go a lot further,
> and rather figure out a way to borrow concepts from queueing theory so
> that they can be applied in the context of Gerrit. For instance
> Little's law [1] says:
> 
> "The long-term average number of customers (in this context reviews)
> in a stable system L is equal to the long-term average effective
> arrival rate, λ, multiplied by the average time a customer spends in
> the system, W; or expressed algebraically: L = λW."
> 
> L can be used to determine the number of core reviewers that a project
> will need at any given time, in order to meet a certain arrival rate
> and average time spent in the queue. If the number of core reviewers
> is a lot less than L then that core team is understaffed and will need
> to increase.
> 
> If we figured out how to model and measure Gerrit as a queuing system,
> then we could improve its performance a lot more effectively; for
> instance, this idea of privileging trivial patches over longer patches
> has roots in a popular scheduling policy [3] for  M/G/1 queues, but
> that does not really help aging of 'longer service time' patches and
> does not have a preemption mechanism built-in to avoid starvation. 
> 
> Just a crazy opinion...
> Armando
> 
> [1] - http://en.wikipedia.org/wiki/Little's_law
> [2] - http://en.wikipedia.org/wiki/Shortest_job_first
> [3] - http://en.wikipedia.org/wiki/M/G/1_queue

This isn't crazy at all. We do have a problem that surely could be
studied and solved/improved by applying queueing theory or lessons from
fields like lean manufacturing. Right now, we're simply applying our
intuition and the little I've read about these sorts of problems is that
your intuition can easily take you down the wrong path.

There's a bunch of things that occur just glancing through those
articles:

  - Do we have an unstable system? Would it be useful to have arrival 
    and exit rate metrics to help highlight this? Over what time period 
    would those rates need to be averaged to be useful? Daily, weekly, 
    monthly, an entire release cycle?

  - What are we trying to optimize for? The length of time in the 
    queue? The number of patches waiting in the queue? The response 
    time to a new patch revision?

  - We have a single queue, with a bunch of service nodes with a wide 
    variance between their service rates, very little in the way of
    scheduling policy, a huge rate of service nodes sending jobs back 
    for rework, a cost associated with maintaining a job while it sits 
    in the queue, the tendency for some jobs to disrupt many other jobs 
    with merge conflicts ... not simple.

  - Is there any sort of natural limit in our queue size that makes the 
    system stable - e.g. do people naturally just stop submitting
    patches at some point?

My intuition on all of this lately is that we need some way to model and
experiment with this queue, and I think we could make some interesting
progress if we could turn it into a queueing network rather than a
single, extremely complex queue.

Say we had a front-end for gerrit which tracked which queue a patch is
in, we could experiment with things like:

  - a triage queue, with non-cores signed up as triagers looking for 
    obvious mistakes and choosing the next queue for a patch to enter 
    into

  - queues having a small number of cores signed up as owners - e.g. 
    high priority bugfix, API, scheduler, object conversion, libvirt
    driver, vmware driver, etc.

  - we'd allow for a large number of queues so that cores could aim for 
    an "inbox zero" approach on individual queues, something that would 
    probably help keep cores motivated.

  - we could apply different scheduling policies to each of the 
    different queues - i.e. explicit guidance for cores about which 
    patches they should pick off the queue next.

  - we could track metrics on individual queues as well as the whole 
    network, identifying bottlenecks and properly recognizing which 
    reviewers are doing a small number of difficult reviews versus 
    those doing a high number of trivial reviews.

  - we could require some queues to feed into a final approval queue 
    where some people are responsible for giving an approved patch a 
    final sanity check - i.e. there would be a class of reviewer with 
    good instincts who quickly churn through already-reviewed patches 
    looking for the kind of mistakes people tend to mistake when 
    they're down in the weeds.

  - explicit queues for large, cross-cutting changes like coding style 
    changes. Perhaps we could stop servicing these queues at certain
    points in the cycles, or reduce the rate at which they are 
    serviced.

  - we could include specs and client patches in the same network so 
    that they prioritized in the same way.

Lots of ideas, none of it is trivial ... but perhaps it'll spark
someone's interest :)

Mark.

Open Stack

[openstack-dev] [nova] A modest proposal to reduce reviewer load

OpenStack

Community

Documentation

Branding & Legal