[openstack-dev] [TripleO] Review metrics - what do we want to measure?
Jesus M. Gonzalez-Barahona
jgb at bitergia.com
Wed Sep 3 06:51:25 UTC 2014
On Wed, 2014-09-03 at 12:58 +1200, Robert Collins wrote:
> On 14 August 2014 11:03, James Polley <jp at jamezpolley.com> wrote:
> > In recent history, we've been looking each week at stats from
> > http://russellbryant.net/openstack-stats/tripleo-openreviews.html to get a
> > gauge on how our review pipeline is tracking.
> > The main stats we've been tracking have been the "since the last revision
> > without -1 or -2". I've included some history at , but the summary is
> > that our 3rd quartile has slipped from 13 days to 16 days over the last 4
> > weeks or so. Our 1st quartile is fairly steady lately, around 1 day (down
> > from 4 a month ago) and median is unchanged around 7 days.
> > There was lots of discussion in our last meeting about what could be causing
> > this. However, the thing we wanted to bring to the list for the
> > discussion is:
> > Are we tracking the right metric? Should we be looking to something else to
> > tell us how well our pipeline is performing?
> > The meeting logs have quite a few suggestions about ways we could tweak the
> > existing metrics, but if we're measuring the wrong thing that's not going to
> > help.
> > I think that what we are looking for is a metric that lets us know whether
> > the majority of patches are getting feedback quickly. Maybe there's some
> > other metric that would give us a good indication?
> If we review all patches quickly and land none, thats bad too :).
> For the reviewers specifically i think we need a metric(s) that:
> - doesn't go bad when submitters go awol, don't respond etc
> - including when they come back - our stats shouldn't jump hugely
> because an old review was resurrected
> - when good means submitters will be getting feedback
> - flag inventory- things we'd be happy to have landed that haven't
> - including things with a -1 from non-core reviewers (*)
> (*) I often see -1's on things core wouldn't -1 due to the learning
> curve involved in becoming core
> So, as Ben says, I think we need to address the its-not-a-vote issue
> as a priority, that has tripped us up in lots of ways
> I think we need to discount -workflow patches where that was set by
> the submitter, which AFAICT we don't do today.
> Looking at current stats:
> Longest waiting reviews (based on oldest rev without -1 or -2):
> 54 days, 2 hours, 41 minutes https://review.openstack.org/106167
> (Keystone/LDAP integration)
> That patch had a -1 on Aug 16 1:23 AM: but was quickyl turned to +2.
> So this patch had a -1 then after discussion it became a +2. And its
> evolved multiple times.
> What should we be saying here? Clearly its had little review input
> over its life, so I think its sadly accurate.
> I wonder if a big chunk of our sliding quartile is just use not
> reviewing the oldest reviews.
I've been researching review process in OpenStack and other projects for
a while, and my impression is that at least three timing metrics are
(1) Total time from submitting a patch to final closing of the review
process (landing that, or a subsequent patch, or finally abandoning).
This gives an idea of how the whole process is working.
(2) Time from submitting a patch to that patch being approved (+2 in
OpenStack, I guess) or declined (and a new patch is requested). This
gives an idea of how quick reviewers are providing definite feedback to
patch submitters, and is a metric for each patch cycle.
(3) Time from a patch being reviewed, with a new patch being requested,
to a new patch being submitted. This gives an idea of the "reaction
time" of patch submitter.
Usually, you want to keep (1) low, while (2) and (3) give you an idea of
what is happening if (1) gets high.
There is another relevant metric in some cases, which is
(4) The number of patch cycles per review cycle (that is, how many
patches are needed per patch landing in master). In some cases, that may
help to explain how (2) and (3) contribute to (1).
And a fifth metric gives you a "throughput" metric:
(5) BMI (backlog management index), number of new review processes by
number of closed review process for a certain period. It gives an idea
of whether the backlog is going up (>1) or down (<1), and is usually
very interesting when seen over time.
(1) alone is not enough to assess on how well the review process is,
because it could low, but (5) showing an increasing backlog because
simply new review requests come too quickly (eg, in periods when
developers are submitting a lot of patch proposals after a freeze). (1)
could also be high, but (5) show a decrease in the backlog, because for
example reviewers or submitters are overworked or slowly scheduled, but
still the project copes with the backlog. Depending on the relationship
of (1) and (5), maybe you need more reviewers, or reviewers scheduling
their reviews with more priority wrt other actions, or something else.
Note for example that in a project with low BMI (<1) for a long period,
but with a high total delay in reviews (1), usually putting more
reviewers doesn't reduce delay: you have enough of them, what you need
is them to schedule reviews earlier (or maybe the problem is in
submitters being late in addressing requests for new patches, you look
at (2), (3) and (4) for that).
I guess usually you need most of these metrics, if not all, to
understand what is happening in a review process.
/me at Twitter: https://twitter.com/jgbarah
More information about the OpenStack-dev