[openstack-dev] [nova] minimum review period for functional changes that break backwards compatibility

Sean Dague sean at dague.net
Thu Jan 2 00:16:58 UTC 2014


On 01/01/2014 04:30 PM, Tim Bell wrote:
>
>> - Changes in default behaviour:   Always likely to affect existing systems in some way.   Maybe we should have an additional type of review vote that comes from people who are recognised as reperensting large production deployments ?
>
> This is my biggest worry...  there are changes which may be technically valid but have a significant disruptive impact on those people who are running clouds. Asking the people who are running production OpenStack clouds to review every patch to understand the risks and assess the migration impact is asking a lot.
>
> What processes are in place to ensure that a technically correct patch does not have major impact on those of us who are running large scale OpenStack clouds in production ? DocImpact tagging is not sufficient as this implies the change is the right thing to do but needs procedures defined.

So I think this comes down to Humans vs. Machines. Which is why I keep 
arguing that I don't think policy changes will help. This is my 
synthesis of a few things, and a lot of time observing our process here.

Humans are limited to how big a code chunk they can effectively review 
(basically once a patch is over 200 lines, error detection rates start 
dropping), and how much time they can spend on code review in a day 
(eventually you go code blind, and some times you don't realize it, 
which is the super dangerous point) [1]. When humans screw up on code 
review (and they figure out they screw up), their instinct is to do less 
review, because no one wants to be doing a crappy job here (no footnote, 
just something I've observed).

Adding policy complexity also reduces that review bandwidth, as more 
cognitive load, of any type, reduces your decision making ability. [2] 
This is actually the core rationale behind the hacking project. It 
reduces cognitive load by taking a whole class of things off the table 
and makes it a machine decision. Take the fact that I needed to 
highlight to folks on the list, and irc, multiple times, that approving 
changes with test results that are a month old, isn't a good idea (and 
has gate impacts), demonstrates the load level reviewers are already on 
in realizing the impact of a patch, so I think we can't win there.

Machines.... we're very good on scalability there, running all our tests 
in clouds, and the like.

Lets take an example pain point, code that was working on Grizzly 
doesn't start when you go to Havana, because there was an incompatible 
config file change. That's terrible. So we built a system called 
Grenade. We use the Grizzly config files with Havana code levels, and it 
has to pass Tempest smoke tests. Does it catch everything? Not at all. 
But I've watched it legitimately block a dozen changes so far in 
icehouse that were not getting caught in reviews, because it wasn't 
until people came looking for me did the author realize they were 
actually breaking an upgrade path. So it is a progressive improvement.

So.... coming back to your point in question, I think we need to figure 
out what the biggest pain points are for large scale OpenStack clouds 
are, then encode them back into validation in our pipeline some how. 
Michael Still recently added "turbo hipster" 3rd party testing which 
ensures new code doesn't dramatically increase the time it takes to run 
the database migrations on large data sets. That's going after one of 
these issues, and is a really great addition to our toolset.

For others we should figure out a top one or two problems that we can 
turn into gate automation that we can implement in each cycle (my 
experience is encoding the logic on each of these is going to take the 
better part of a cycle). The real advantage here is that these are gifts 
that keep on giving.

Everyone is trying to be cognizant here of the impact on users. However, 
we all are just human.

[1] - 
http://www.ibm.com/developerworks/rational/library/11-proven-practices-for-peer-review/
[2] - 
http://www.jstor.org/discover/10.2307/2489029?uid=3739832&uid=2&uid=4&uid=3739256&sid=21103190092191

	-Sean

-- 
Sean Dague
Samsung Research America
sean at dague.net / sean.dague at samsung.com
http://dague.net



More information about the OpenStack-dev mailing list