[openstack-dev] [nova] minimum review period for functional changes that break backwards compatibility
sean at dague.net
Thu Jan 2 00:16:58 UTC 2014
On 01/01/2014 04:30 PM, Tim Bell wrote:
>> - Changes in default behaviour: Always likely to affect existing systems in some way. Maybe we should have an additional type of review vote that comes from people who are recognised as reperensting large production deployments ?
> This is my biggest worry... there are changes which may be technically valid but have a significant disruptive impact on those people who are running clouds. Asking the people who are running production OpenStack clouds to review every patch to understand the risks and assess the migration impact is asking a lot.
> What processes are in place to ensure that a technically correct patch does not have major impact on those of us who are running large scale OpenStack clouds in production ? DocImpact tagging is not sufficient as this implies the change is the right thing to do but needs procedures defined.
So I think this comes down to Humans vs. Machines. Which is why I keep
arguing that I don't think policy changes will help. This is my
synthesis of a few things, and a lot of time observing our process here.
Humans are limited to how big a code chunk they can effectively review
(basically once a patch is over 200 lines, error detection rates start
dropping), and how much time they can spend on code review in a day
(eventually you go code blind, and some times you don't realize it,
which is the super dangerous point) . When humans screw up on code
review (and they figure out they screw up), their instinct is to do less
review, because no one wants to be doing a crappy job here (no footnote,
just something I've observed).
Adding policy complexity also reduces that review bandwidth, as more
cognitive load, of any type, reduces your decision making ability. 
This is actually the core rationale behind the hacking project. It
reduces cognitive load by taking a whole class of things off the table
and makes it a machine decision. Take the fact that I needed to
highlight to folks on the list, and irc, multiple times, that approving
changes with test results that are a month old, isn't a good idea (and
has gate impacts), demonstrates the load level reviewers are already on
in realizing the impact of a patch, so I think we can't win there.
Machines.... we're very good on scalability there, running all our tests
in clouds, and the like.
Lets take an example pain point, code that was working on Grizzly
doesn't start when you go to Havana, because there was an incompatible
config file change. That's terrible. So we built a system called
Grenade. We use the Grizzly config files with Havana code levels, and it
has to pass Tempest smoke tests. Does it catch everything? Not at all.
But I've watched it legitimately block a dozen changes so far in
icehouse that were not getting caught in reviews, because it wasn't
until people came looking for me did the author realize they were
actually breaking an upgrade path. So it is a progressive improvement.
So.... coming back to your point in question, I think we need to figure
out what the biggest pain points are for large scale OpenStack clouds
are, then encode them back into validation in our pipeline some how.
Michael Still recently added "turbo hipster" 3rd party testing which
ensures new code doesn't dramatically increase the time it takes to run
the database migrations on large data sets. That's going after one of
these issues, and is a really great addition to our toolset.
For others we should figure out a top one or two problems that we can
turn into gate automation that we can implement in each cycle (my
experience is encoding the logic on each of these is going to take the
better part of a cycle). The real advantage here is that these are gifts
that keep on giving.
Everyone is trying to be cognizant here of the impact on users. However,
we all are just human.
Samsung Research America
sean at dague.net / sean.dague at samsung.com
More information about the OpenStack-dev