[openstack-dev] [nova] A modest proposal to reduce reviewer load
mbooth at redhat.com
Thu Jun 19 12:42:02 UTC 2014
On 19/06/14 13:22, Mark McLoughlin wrote:
> On Thu, 2014-06-19 at 09:34 +0100, Matthew Booth wrote:
>> On 19/06/14 08:32, Mark McLoughlin wrote:
>>> Hi Armando,
>>> On Tue, 2014-06-17 at 14:51 +0200, Armando M. wrote:
>>>> I wonder what the turnaround of trivial patches actually is, I bet you
>>>> it's very very small, and as Daniel said, the human burden is rather
>>>> minimal (I would be more concerned about slowing them down in the
>>>> gate, but I digress).
>>>> I think that introducing a two-tier level for patch approval can only
>>>> mitigate the problem, but I wonder if we'd need to go a lot further,
>>>> and rather figure out a way to borrow concepts from queueing theory so
>>>> that they can be applied in the context of Gerrit. For instance
>>>> Little's law  says:
>>>> "The long-term average number of customers (in this context reviews)
>>>> in a stable system L is equal to the long-term average effective
>>>> arrival rate, λ, multiplied by the average time a customer spends in
>>>> the system, W; or expressed algebraically: L = λW."
>>>> L can be used to determine the number of core reviewers that a project
>>>> will need at any given time, in order to meet a certain arrival rate
>>>> and average time spent in the queue. If the number of core reviewers
>>>> is a lot less than L then that core team is understaffed and will need
>>>> to increase.
>>>> If we figured out how to model and measure Gerrit as a queuing system,
>>>> then we could improve its performance a lot more effectively; for
>>>> instance, this idea of privileging trivial patches over longer patches
>>>> has roots in a popular scheduling policy  for M/G/1 queues, but
>>>> that does not really help aging of 'longer service time' patches and
>>>> does not have a preemption mechanism built-in to avoid starvation.
>>>> Just a crazy opinion...
>>>>  - http://en.wikipedia.org/wiki/Little's_law
>>>>  - http://en.wikipedia.org/wiki/Shortest_job_first
>>>>  - http://en.wikipedia.org/wiki/M/G/1_queue
>>> This isn't crazy at all. We do have a problem that surely could be
>>> studied and solved/improved by applying queueing theory or lessons from
>>> fields like lean manufacturing. Right now, we're simply applying our
>>> intuition and the little I've read about these sorts of problems is that
>>> your intuition can easily take you down the wrong path.
>>> There's a bunch of things that occur just glancing through those
>>> - Do we have an unstable system? Would it be useful to have arrival
>>> and exit rate metrics to help highlight this? Over what time period
>>> would those rates need to be averaged to be useful? Daily, weekly,
>>> monthly, an entire release cycle?
>>> - What are we trying to optimize for? The length of time in the
>>> queue? The number of patches waiting in the queue? The response
>>> time to a new patch revision?
>>> - We have a single queue, with a bunch of service nodes with a wide
>>> variance between their service rates, very little in the way of
>>> scheduling policy, a huge rate of service nodes sending jobs back
>>> for rework, a cost associated with maintaining a job while it sits
>>> in the queue, the tendency for some jobs to disrupt many other jobs
>>> with merge conflicts ... not simple.
>>> - Is there any sort of natural limit in our queue size that makes the
>>> system stable - e.g. do people naturally just stop submitting
>>> patches at some point?
>>> My intuition on all of this lately is that we need some way to model and
>>> experiment with this queue, and I think we could make some interesting
>>> progress if we could turn it into a queueing network rather than a
>>> single, extremely complex queue.
>>> Say we had a front-end for gerrit which tracked which queue a patch is
>>> in, we could experiment with things like:
>>> - a triage queue, with non-cores signed up as triagers looking for
>>> obvious mistakes and choosing the next queue for a patch to enter
>>> - queues having a small number of cores signed up as owners - e.g.
>>> high priority bugfix, API, scheduler, object conversion, libvirt
>>> driver, vmware driver, etc.
>>> - we'd allow for a large number of queues so that cores could aim for
>>> an "inbox zero" approach on individual queues, something that would
>>> probably help keep cores motivated.
>>> - we could apply different scheduling policies to each of the
>>> different queues - i.e. explicit guidance for cores about which
>>> patches they should pick off the queue next.
>>> - we could track metrics on individual queues as well as the whole
>>> network, identifying bottlenecks and properly recognizing which
>>> reviewers are doing a small number of difficult reviews versus
>>> those doing a high number of trivial reviews.
>>> - we could require some queues to feed into a final approval queue
>>> where some people are responsible for giving an approved patch a
>>> final sanity check - i.e. there would be a class of reviewer with
>>> good instincts who quickly churn through already-reviewed patches
>>> looking for the kind of mistakes people tend to mistake when
>>> they're down in the weeds.
>>> - explicit queues for large, cross-cutting changes like coding style
>>> changes. Perhaps we could stop servicing these queues at certain
>>> points in the cycles, or reduce the rate at which they are
>>> - we could include specs and client patches in the same network so
>>> that they prioritized in the same way.
>>> Lots of ideas, none of it is trivial ... but perhaps it'll spark
>>> someone's interest :)
>> This is all good stuff, but by the sounds of it experimenting in gerrit
>> isn't likely to be simple.
> Right, which is why I said "a front-end for gerrit" ... this would sit
> in front of gerrit, at least to begin with.
>> Remember, though, that the relevant metric is code quality, not review
> Not sure what conclusion you're driving towards there - obviously that's
> true, but ... ignore review rate? measure code quality how?
I mean that there are other ways to improve code quality than review. If
we find ourselves putting a lot of effort into this, we may be better
served by looking for low hanging fruit elsewhere.
Red Hat Engineering, Virtualisation Team
Phone: +442070094448 (UK)
GPG ID: D33C3490
GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490
More information about the OpenStack-dev