[openstack-dev] [nova] Averting the Nova crisis by splitting out virt drivers

Daniel P. Berrange berrange at redhat.com
Thu Sep 4 10:24:29 UTC 2014

Position statement

Over the past year I've increasingly come to the conclusion that
Nova is heading for (or probably already at) a major crisis. If
steps are not taken to avert this, the project is likely to loose
a non-trivial amount of talent, both regular code contributors and
core team members. That includes myself. This is not good for
Nova's long term health and so should be of concern to anyone
involved in Nova and OpenStack.

For those who don't want to read the whole mail, the executive
summary is that the nova-core team is an unfixable bottleneck
in our development process with our current project structure.
The only way I see to remove the bottleneck is to split the virt
drivers out of tree and let them all have their own core teams
in their area of code, leaving current nova core to focus on
all the common code outside the virt driver impls. I, now, none
the less urge people to read the whole mail.

Background information

I see many factors coming together to form the crisis

 - Burn out of core team members from over work 
 - Difficulty bringing new talent into the core team
 - Long delay in getting code reviewed & merged
 - Marginalization of code areas which aren't popular
 - Increasing size of nova code through new drivers
 - Exclusion of developers without corporate backing

Each item on their own may not seem too bad, but combined they
add up to a big problem.

Core team burn out

Having been involved in Nova for several dev cycles now, it is clear
that the backlog of code up for review never goes away. Even
intensive code review efforts at various points in the dev cycle
makes only a small impact on the backlog. This has a pretty
significant impact on core team members, as their work is never
done. At best, the dial is sometimes set to 10, instead of 11.

Many people, myself included, have built tools to help deal with
the reviews in a more efficient manner than plain gerrit allows
for. These certainly help, but they can't ever solve the problem
on their own - just make it slightly more bearable. And this is
not even considering that core team members might have useful
contributions to make in ways beyond just code review. Ultimately
the workload is just too high to sustain the levels of review
required, so core team members will eventually burn out (as they
have done many times already).

Even if one person attempts to take the initiative to heavily
invest in review of certain features it is often to no avail.
Unless a second dedicated core reviewer can be found to 'tag
team' it is hard for one person to make a difference. The end
result is that a patch is +2d and then sits idle for weeks or
more until a merge conflict requires it to be reposted at which
point even that one +2 is lost. This is a pretty demotivating
outcome for both reviewers & the patch contributor.

New core team talent

It can't escape attention that the Nova core team does not grow
in size very often. When Nova was younger and its code base was
smaller, it was easier for contributors to get onto core because
the base level of knowledge required was that much smaller. To
get onto core today requires a major investment in learning Nova
over a year or more. Even people who potentially have the latent
skills may not have the time available to invest in learning the
entire of Nova.

With the number of reviews proposed to Nova, the core team should
probably be at least double its current size[1]. There is plenty of
expertize in the project as a whole but it is typically focused
into specific areas of the codebase. There is nowhere we can find
20 more people with broad knowledge of the codebase who could be
promoted even over the next year, let alone today. This is ignoring
that many existing members of core are relatively inactive due to
burnout and so need replacing. That means we really need another
25-30 people for core. That's not going to happen.

Code review delays

The obvious result of having too much work for too few reviewers
is that code contributors face major delays in getting their work
reviewed and merged. From personal experience, during Juno, I've
probably spent 1 week in aggregate on actual code development vs
8 weeks on waiting on code review. You have to constantly be on
alert for review comments because unless you can respond quickly
(and repost) while you still have the attention of the reviewer,
they may not be look again for days/weeks.

The length of time to get work merged serves as a demotivator to
actually do work in the first place. I've personally avoided doing
alot of code refactoring & cleanup work that would improve the
maintainability of the libvirt driver in the long term, because
I can't face the battle to get it reviewed & merged. Other people
have told me much the same. It is not uncommon to see changes that
have been pending for 2 dev cycles, not because the code was bad
but because they couldn't get people to review it. Contributors
will simply walk away from nova if that happens too often.

Even when fate is on your side and code is reviewed, the chances
of it getting a success result from the CI systems first time
around is slim due to false failures. This really compounds the
already poor experiance of submitting code to Nova.

Marginalization of areas

Since the core team has far more work to do than it can manage, it
has to prioritize what it looks at. The core team figures out what
the overall project priorities are and will focus more effort in
to those areas. Individual members will also focus their attention
in areas where they have personal interest. Unfortunately the core
team is not representative of the entire of Nova codebase. The
inevitable result is that the HyperV and VMWare drivers can often
loose out in the battle for attention. In the past we've said that
it is the responsibility of people in those teams to invest in
learning the entire of Nova so that they have the knowledge required
to be promoted to core. I used to support that approach, but now
consider to be flawed due to the increased difficulty of *anyone*
getting onto core. The time investment required is simply too great
to expect people to undertake it. The marginalized areas have no
freedom to self-organize to solve their own problems because they
are forever dependant on the core team bottleneck.

Increasing size

There is a long standing policy that the Nova virt driver API is
considered unstable and thus all virt driver implementations should
ultimately be part of the Nova codebase. In Juno it is likely that
the Ironic driver will be merged into Nova. In a future release we
may yet see the Docker driver return to the Nova tree.

The result of merging yet more drivers is that there will be yet
more work for nova reviewers to do. It is far from obvious that
merging new drivers will be accompanied by new members on the core
team. So it is likely that the workload is going to get worse over
future releases.

Splitting out the scheduler will be beneficial in reducing the
review backlog, but probably not enough to counter the growth from
virt drivers. Killing of nova-network is unlikely to help at all,
since that consumes little-to-no review time currently [2]. 

Exclusion of non-corporate devs

There is a strong push from nova core for everything that is merged
into Nova to be accompanied by CI testing. This certainly makes sense
from the POV of overall product quality and reducing the burden on
the core reviewers to catch all mistakes through code review. What
we don't take into account is that setting up and maintaining such
testing infrastructure requires a major investment in terms of both
hardware costs and man power. It has already been seen that this is
too much to bear for some companies who contribute to Nova, eg with
the Docker driver [3]. Developers who are not affiliated with any
company do not stand any realistic chance of meeting the CI testing
needs unless they're lucky that their feature can be covered by an
existing running CI system. This looks like it could effectively
prevent support for a community submitted FreeBSD BHyve driver from
being merged, no matter how useful it might be to users who want it.
NB, now a FreeBSD BHyve driver would probably be done as part of the
libvirt driver, which complicates this particular point I'm trying
to make, since I don't suggest reducing testing of the libvirt driver
compared to what it has today.

I don't want to get into a detailed testing discussion here really,
since that's somewhat of a tangent to the question of our dev and
review process. I am, however, concerned when our testing policy
forces maintainers of some virt drivers into the position of being
treated as second class citizens within the project as a whole, with
a different development structure to the in-tree approved drivers.
That said, Docker probably benefits from being out of tree, since it
thus avoids the painful nova core bottleneck entirely.

Problem summary

The common thread through most of these problems is that the nova
core team is a massive bottleneck in the development process.
Processes adopted (or under discussion) by the core team are
fundamentally not helping to remove the bottleneck. Rather they are
introducing new layers of beaurocracy so that we can feel justified
in telling contributors that we are going to ignore or reject their
work. At best this is going to result in far less useful work taking
place in Nova. At worst this is further reducing the ability of
people to self organize to solve the problems, will cause our
contribtors to leave the community and possibly even force some virt
drivers to go out of tree to get their work done. Death by a thousand

A sub-thread is around the idea that our current structure of one big
repo also has other negative consequences for drivers who may not be
able to meet the same high standards as the rest of the drivers. A
driver is either in or out of the club, and if its out of the club
life is made comparatively harder for its developers & users. By all
means have rules around that requirements for a release to use the
openstack trademarks based on CI testing coverage, but don't let that
penalize the actual development process itself.

Overall Nova is being increasingly hostile to its community of
contributors. I don't mean this as a result of any sense of malice
or ill-will. What we're seeing is merely a symptom of a hard worked
team struggling to survive with a burden they can no longer be
reasonably expected to cope with. Nova core has done an amazing job
at surviving for so long as the project grew much larger & more
quickly than anyone probably expected. The time has come for some
radical changes to let nova adapt & evolve to the next level.

This is a crisis. A large crisis. In fact, if you got a moment, it's
a twelve-storey crisis with a magnificent entrance hall, carpeting
throughout, 24-hour portage, and an enormous sign on the roof,
saying 'This Is a Large Crisis'. A large crisis requires a large

Proposal / solution

In the past Nova has spun out its volume layer to form the cinder
project. The Neutron project started as an attempt to solve the
networking space, and ultimately replace the nova-network. It
is likely that the schedular will be spun out to a separate project.

Now Neutron itself has grown so large and successful that it is
considering going one step further and spinning its actual drivers
out of tree into standalone add-on projects [4]. I've heard on the
grapevine that Ironic is considering similar steps for hardware

The radical (?) solution to the nova core team bottleneck is thus to
follow this lead and split the nova virt drivers out into separate
projects and delegate their maintainence to new dedicated teams.

 - Nova becomes the home for the public APIs, RPC system, database
   persistent and the glue that ties all this together with the
   virt driver API.

 - Each virt driver project gets its own core team and is responsible
   for dealing with review, merge & release of their codebase.

Note, I really do mean *all* virt drivers should be separate. I do
not want to see some virt drivers split out and others remain in tree
because I feel that signifies that the out of tree ones are second
class citizens. It is important to set up our dev structure so that
every virt driver is treated equally & so has equal chance to achieve
success. As long as one driver remains in tree there will always be
pressure for others to join it, which is exactly what we're trying
to get away from here. By everyone being out of tree, drivers (like
Docker) can take a decision about whether it is the right time for
them to be investing in gating CI systems, without being penalized
in their dev process if they make a decision to not have gate tests
right now.

This has quite a few implications for the way development would

 - The Nova core team at least, would be voluntarily giving up a big
   amount of responsibility over the evolution of virt drivers. Due
   to human nature, people are not good at giving up power, so this
   may be painful to swallow. Realistically current nova core are
   not experts in most of the virt drivers to start with, and more
   important we clearly do not have sufficient time to do a good job
   of review with everything submitted. Much of the current need
   for core review of virt drivers is to prevent the mis-use of a
   poorly defined virt driver API...which can be mitigated - See
   later point(s)

 - Nova core would/should not have automatic +2 over the virt driver
   repositories since it is unreasonable to assume they have the
   suitable domain knowledge for all virt drivers out there. People
   would of course be able to be members of multiple core teams. For
   example John G would naturally be nova-core and nova-xen-core. I
   would aim for nova-core and nova-libvirt-core, and so on. I do not
   want any +2 responsibility over VMWare/HyperV/Docker drivers since
   they're not my area of expertize - I only look at them today because
   they have no other nova-core representation.

 - Not sure if it implies the Nova PTL would be solely focused on
   Nova common. eg would there continue to be one PTL over all virt
   driver implementation projects, or would each project have its
   own PTL. Maybe this is irrelevant if a Czars approach is chosen
   by virt driver projects for their work. I'd be inclined to say
   that a single PTL should stay as a figurehead to represent all
   the virt driver projects, acting as a point of contact to ensure
   we keep communication / co-operation between the drivers in sync.

 - A fairly significant amount of nova code would need to be
   considered semi-stable API. Certainly everything under nova/virt
   and any object which is passed in/out of the virt driver API.
   Changes to such APIs would have to be done in a backwards
   compatible manner, since it is no longer possible to lock-step
   change all the virt driver impls. In some ways I think this would
   be a good thing as it will encourage people to put more thought
   into the long term maintainability of nova internal code instead
   of relying on being able to rip it apart later, at will.

 - The nova/virt/driver.py class would need to be much better
   specified. All parameters / return values which are opaque dicts
   must be replaced with objects + attributes. Completion of the
   objectification work is mandatory, so there is cleaner separation
   between virt driver impls & the rest of Nova.

 - If changes are required to common code, the virt driver developer
   would first have to get the necccessary pieces merged into Nova
   common. Then the follow up virt driver specific changes could be
   proposed to their repo. This implies that some changes to virt
   drivers will still contend for resource in the common nova repo 
   and team. This contention should be lower than it is today though
   since the current nova core team should have less code to look 
   after per-person on aggregate.

 - Changes submitted to nova common code would trigger running of CI
   tests against the external virt drivers. Each virt driver core team
   would decide whether they want their driver to be tested upon Nova
   common changes. Expect that all would choose to be included to the
   same extent that they are today. So level of validation of nova code
   would remain at least at current level. I don't want to reduce the
   amount of code testing here since that's contrary to the direction
   we're taking wrt testing.

 - Changes submitted to virt drivers would trigger running CI tests
   that are applicable. eg changes to libvirt driver repo would not
   involve running database migration tests, since all database code
   is isolated in nova. libvirt changes would not trigger vmware,
   xenserver, ironic, etc CI systems. Virt driver changes should
   see fewer false positives in the tests as a result, and those
   that do occur should be more explicitly related to the code being
   proposed. eg a change to vmware is not going to trigger a tempest
   run that uses libvirt, so non-deterministic failures in libvirt
   will no longer plague vmware developers reviews. This would also
   make it possible for VMWare CI to be made gating for changes to
   the VMWare virt driver repository, without negatively impacting
   other virt drivers. So this change should increase testing quality
   for non-libvirt virt drivers and reduce pain of false failures
   for everyone.

 - Virt drivers shouldn't use oslo incubator code from nova, since
   that can be replaced any time and isn't upgrade safe. Ideally most
   of the incubator stuff virt drivers need should turn into stable
   oslo APIs. Failing that, virt drivers would need their own copy
   of the incubated code in their module namespace, to avoid clash
   or the need to lock-step upgrade code across separate git repos.

Overall the outcome is that

 - Far larger pool of people able to approve changes for merge
   across nova core and the virt driver core teams.

 - Faster review & merge for virt driver patches that don't involve
   changes to common nova code, with less CI system testing pain.

 - Ability to set priority of work in virt drivers without a 3rd
   party being a bottleneck, where the work doesn't involve changes
   to common nova code.

 - Each virt driver team can accept as many features as they feel
   able to deal with, without it negatively impacting amount of
   features that other virt driver teams can accept.

 - Virt drivers have flexibility to set their own policies on testing
   without being penalized in the way they then develop their code.

The migration

Obviously a proposal such as this is a pretty major undertaking. It
should be clear that it could not be done in a short amount of time.
It is suggested that it be phased in over two dev cycles. In the Kilo
release the focus would be on prep work:

  - Formalizing the separation between the virt driver impls and the
    rest of the nova codebase. Figure out exactly which areas of 
    Nova internal code will need to be marked as 'semi-stable' for 
    use by virt drivers, and ensure their APIs are sufficiently
    future proof.

  - Discussions with the infrastructure, docs, release, etc teams to
    identify impacts on them and do any required prep work.

  - Identify the teams which will lead the new virt driver projects.
    eg core reviewers, PTL or Czars for each job if applicable

  - Probably more things I can't think of right now

Then at the start of the Lxxxx release, the virt drivers would
actually be split out into separate git repos and start their dev
process for the future. So for bulk of Lxxxx the drivers would be
on their own. The two Lxxxx rc milestones would allow us to ensure
our release processes were working well with the split drivers
before the Lxxxx final release.

Final thought

Overall consider this a vote of no confidence in nova continuing to
operate as it does today. As mentioned above this is not intended to
be disrepectful to the effort every nova core member has put in, just
a reflection on the changed environment we find ourselves in. Fiddling
with our processes for the prioritization of work cannot fix the
fundamental fact that nova core today is a massive single point of
failure & bottleneck, increasingly crippling the project. The only way
to address this is by a radical re-organization of our project to
remove the bottlenecks by modularization of the project & leaders.
Keeping a single team and adding more/changing process is simply akin
to shifting deckchairs on the titanic and not a viable option to coninue
with long term.

Now, I'm realistic. Even with every driver separated out, I expect
that each of them will individually still have more work proposed
than their respective core teams have time to review. The new structure
will, however, make it easier for the core individal teams to grow &
adapt in ways that suit their specific needs. For self-contained virt
driver changes it will mean that acceptance of work by one team will
not take away capacity from another team. Further the burden of
knowledge required to make it onto a virt driver core team would be
greatly reduced due to the narrower focus of each core team, so we'll
be able to promote good talent onto virt driver core teams more quickly.

Thanks for reading so far. Now lets make some real change to prepare
us for future sustainability & even growth.


[1] http://lists.openstack.org/pipermail/openstack-dev/2014-August/044459.html
[2] There was a ban on changes to nova-network for much of the past two
    cycles. It was relaxed primarily to allow full conversion of nova
    codebase to use objects, not for major new feature development.
[3] http://lists.openstack.org/pipermail/openstack-dev/2014-July/040443.html
[4] http://lists.openstack.org/pipermail/openstack-dev/2014-August/043036.html

|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

More information about the OpenStack-dev mailing list