[openstack-dev] [nova] Averting the Nova crisis by splitting out virt drivers
Vishvananda Ishaya
vishvananda at gmail.com
Wed Sep 10 19:14:24 UTC 2014
On Sep 4, 2014, at 3:24 AM, Daniel P. Berrange <berrange at redhat.com> wrote:
> Position statement
> ==================
>
> Over the past year I've increasingly come to the conclusion that
> Nova is heading for (or probably already at) a major crisis. If
> steps are not taken to avert this, the project is likely to loose
> a non-trivial amount of talent, both regular code contributors and
> core team members. That includes myself. This is not good for
> Nova's long term health and so should be of concern to anyone
> involved in Nova and OpenStack.
>
> For those who don't want to read the whole mail, the executive
> summary is that the nova-core team is an unfixable bottleneck
> in our development process with our current project structure.
> The only way I see to remove the bottleneck is to split the virt
> drivers out of tree and let them all have their own core teams
> in their area of code, leaving current nova core to focus on
> all the common code outside the virt driver impls. I, now, none
> the less urge people to read the whole mail.
I am highly in favor of this approach (and have been for at
least a year). Every time we have brought this up in the past
there has been concern about the shared code, but we have to
make a change. We have tried various other approaches and none
of them have made a dent.
+1000
Vish
>
>
> Background information
> ======================
>
> I see many factors coming together to form the crisis
>
> - Burn out of core team members from over work
> - Difficulty bringing new talent into the core team
> - Long delay in getting code reviewed & merged
> - Marginalization of code areas which aren't popular
> - Increasing size of nova code through new drivers
> - Exclusion of developers without corporate backing
>
> Each item on their own may not seem too bad, but combined they
> add up to a big problem.
>
> Core team burn out
> ------------------
>
> Having been involved in Nova for several dev cycles now, it is clear
> that the backlog of code up for review never goes away. Even
> intensive code review efforts at various points in the dev cycle
> makes only a small impact on the backlog. This has a pretty
> significant impact on core team members, as their work is never
> done. At best, the dial is sometimes set to 10, instead of 11.
>
> Many people, myself included, have built tools to help deal with
> the reviews in a more efficient manner than plain gerrit allows
> for. These certainly help, but they can't ever solve the problem
> on their own - just make it slightly more bearable. And this is
> not even considering that core team members might have useful
> contributions to make in ways beyond just code review. Ultimately
> the workload is just too high to sustain the levels of review
> required, so core team members will eventually burn out (as they
> have done many times already).
>
> Even if one person attempts to take the initiative to heavily
> invest in review of certain features it is often to no avail.
> Unless a second dedicated core reviewer can be found to 'tag
> team' it is hard for one person to make a difference. The end
> result is that a patch is +2d and then sits idle for weeks or
> more until a merge conflict requires it to be reposted at which
> point even that one +2 is lost. This is a pretty demotivating
> outcome for both reviewers & the patch contributor.
>
>
> New core team talent
> --------------------
>
> It can't escape attention that the Nova core team does not grow
> in size very often. When Nova was younger and its code base was
> smaller, it was easier for contributors to get onto core because
> the base level of knowledge required was that much smaller. To
> get onto core today requires a major investment in learning Nova
> over a year or more. Even people who potentially have the latent
> skills may not have the time available to invest in learning the
> entire of Nova.
>
> With the number of reviews proposed to Nova, the core team should
> probably be at least double its current size[1]. There is plenty of
> expertize in the project as a whole but it is typically focused
> into specific areas of the codebase. There is nowhere we can find
> 20 more people with broad knowledge of the codebase who could be
> promoted even over the next year, let alone today. This is ignoring
> that many existing members of core are relatively inactive due to
> burnout and so need replacing. That means we really need another
> 25-30 people for core. That's not going to happen.
>
>
> Code review delays
> ------------------
>
> The obvious result of having too much work for too few reviewers
> is that code contributors face major delays in getting their work
> reviewed and merged. From personal experience, during Juno, I've
> probably spent 1 week in aggregate on actual code development vs
> 8 weeks on waiting on code review. You have to constantly be on
> alert for review comments because unless you can respond quickly
> (and repost) while you still have the attention of the reviewer,
> they may not be look again for days/weeks.
>
> The length of time to get work merged serves as a demotivator to
> actually do work in the first place. I've personally avoided doing
> alot of code refactoring & cleanup work that would improve the
> maintainability of the libvirt driver in the long term, because
> I can't face the battle to get it reviewed & merged. Other people
> have told me much the same. It is not uncommon to see changes that
> have been pending for 2 dev cycles, not because the code was bad
> but because they couldn't get people to review it. Contributors
> will simply walk away from nova if that happens too often.
>
> Even when fate is on your side and code is reviewed, the chances
> of it getting a success result from the CI systems first time
> around is slim due to false failures. This really compounds the
> already poor experiance of submitting code to Nova.
>
>
> Marginalization of areas
> ------------------------
>
> Since the core team has far more work to do than it can manage, it
> has to prioritize what it looks at. The core team figures out what
> the overall project priorities are and will focus more effort in
> to those areas. Individual members will also focus their attention
> in areas where they have personal interest. Unfortunately the core
> team is not representative of the entire of Nova codebase. The
> inevitable result is that the HyperV and VMWare drivers can often
> loose out in the battle for attention. In the past we've said that
> it is the responsibility of people in those teams to invest in
> learning the entire of Nova so that they have the knowledge required
> to be promoted to core. I used to support that approach, but now
> consider to be flawed due to the increased difficulty of *anyone*
> getting onto core. The time investment required is simply too great
> to expect people to undertake it. The marginalized areas have no
> freedom to self-organize to solve their own problems because they
> are forever dependant on the core team bottleneck.
>
>
> Increasing size
> ---------------
>
> There is a long standing policy that the Nova virt driver API is
> considered unstable and thus all virt driver implementations should
> ultimately be part of the Nova codebase. In Juno it is likely that
> the Ironic driver will be merged into Nova. In a future release we
> may yet see the Docker driver return to the Nova tree.
>
> The result of merging yet more drivers is that there will be yet
> more work for nova reviewers to do. It is far from obvious that
> merging new drivers will be accompanied by new members on the core
> team. So it is likely that the workload is going to get worse over
> future releases.
>
> Splitting out the scheduler will be beneficial in reducing the
> review backlog, but probably not enough to counter the growth from
> virt drivers. Killing of nova-network is unlikely to help at all,
> since that consumes little-to-no review time currently [2].
>
>
> Exclusion of non-corporate devs
> -------------------------------
>
> There is a strong push from nova core for everything that is merged
> into Nova to be accompanied by CI testing. This certainly makes sense
> from the POV of overall product quality and reducing the burden on
> the core reviewers to catch all mistakes through code review. What
> we don't take into account is that setting up and maintaining such
> testing infrastructure requires a major investment in terms of both
> hardware costs and man power. It has already been seen that this is
> too much to bear for some companies who contribute to Nova, eg with
> the Docker driver [3]. Developers who are not affiliated with any
> company do not stand any realistic chance of meeting the CI testing
> needs unless they're lucky that their feature can be covered by an
> existing running CI system. This looks like it could effectively
> prevent support for a community submitted FreeBSD BHyve driver from
> being merged, no matter how useful it might be to users who want it.
> NB, now a FreeBSD BHyve driver would probably be done as part of the
> libvirt driver, which complicates this particular point I'm trying
> to make, since I don't suggest reducing testing of the libvirt driver
> compared to what it has today.
>
> I don't want to get into a detailed testing discussion here really,
> since that's somewhat of a tangent to the question of our dev and
> review process. I am, however, concerned when our testing policy
> forces maintainers of some virt drivers into the position of being
> treated as second class citizens within the project as a whole, with
> a different development structure to the in-tree approved drivers.
> That said, Docker probably benefits from being out of tree, since it
> thus avoids the painful nova core bottleneck entirely.
>
>
> Problem summary
> ---------------
>
> The common thread through most of these problems is that the nova
> core team is a massive bottleneck in the development process.
> Processes adopted (or under discussion) by the core team are
> fundamentally not helping to remove the bottleneck. Rather they are
> introducing new layers of beaurocracy so that we can feel justified
> in telling contributors that we are going to ignore or reject their
> work. At best this is going to result in far less useful work taking
> place in Nova. At worst this is further reducing the ability of
> people to self organize to solve the problems, will cause our
> contribtors to leave the community and possibly even force some virt
> drivers to go out of tree to get their work done. Death by a thousand
> cuts.
>
> A sub-thread is around the idea that our current structure of one big
> repo also has other negative consequences for drivers who may not be
> able to meet the same high standards as the rest of the drivers. A
> driver is either in or out of the club, and if its out of the club
> life is made comparatively harder for its developers & users. By all
> means have rules around that requirements for a release to use the
> openstack trademarks based on CI testing coverage, but don't let that
> penalize the actual development process itself.
>
> Overall Nova is being increasingly hostile to its community of
> contributors. I don't mean this as a result of any sense of malice
> or ill-will. What we're seeing is merely a symptom of a hard worked
> team struggling to survive with a burden they can no longer be
> reasonably expected to cope with. Nova core has done an amazing job
> at surviving for so long as the project grew much larger & more
> quickly than anyone probably expected. The time has come for some
> radical changes to let nova adapt & evolve to the next level.
>
> This is a crisis. A large crisis. In fact, if you got a moment, it's
> a twelve-storey crisis with a magnificent entrance hall, carpeting
> throughout, 24-hour portage, and an enormous sign on the roof,
> saying 'This Is a Large Crisis'. A large crisis requires a large
> plan.
>
>
> Proposal / solution
> ===================
>
> In the past Nova has spun out its volume layer to form the cinder
> project. The Neutron project started as an attempt to solve the
> networking space, and ultimately replace the nova-network. It
> is likely that the schedular will be spun out to a separate project.
>
> Now Neutron itself has grown so large and successful that it is
> considering going one step further and spinning its actual drivers
> out of tree into standalone add-on projects [4]. I've heard on the
> grapevine that Ironic is considering similar steps for hardware
> drivers.
>
> The radical (?) solution to the nova core team bottleneck is thus to
> follow this lead and split the nova virt drivers out into separate
> projects and delegate their maintainence to new dedicated teams.
>
> - Nova becomes the home for the public APIs, RPC system, database
> persistent and the glue that ties all this together with the
> virt driver API.
>
> - Each virt driver project gets its own core team and is responsible
> for dealing with review, merge & release of their codebase.
>
> Note, I really do mean *all* virt drivers should be separate. I do
> not want to see some virt drivers split out and others remain in tree
> because I feel that signifies that the out of tree ones are second
> class citizens. It is important to set up our dev structure so that
> every virt driver is treated equally & so has equal chance to achieve
> success. As long as one driver remains in tree there will always be
> pressure for others to join it, which is exactly what we're trying
> to get away from here. By everyone being out of tree, drivers (like
> Docker) can take a decision about whether it is the right time for
> them to be investing in gating CI systems, without being penalized
> in their dev process if they make a decision to not have gate tests
> right now.
>
> This has quite a few implications for the way development would
> operate.
>
> - The Nova core team at least, would be voluntarily giving up a big
> amount of responsibility over the evolution of virt drivers. Due
> to human nature, people are not good at giving up power, so this
> may be painful to swallow. Realistically current nova core are
> not experts in most of the virt drivers to start with, and more
> important we clearly do not have sufficient time to do a good job
> of review with everything submitted. Much of the current need
> for core review of virt drivers is to prevent the mis-use of a
> poorly defined virt driver API...which can be mitigated - See
> later point(s)
>
> - Nova core would/should not have automatic +2 over the virt driver
> repositories since it is unreasonable to assume they have the
> suitable domain knowledge for all virt drivers out there. People
> would of course be able to be members of multiple core teams. For
> example John G would naturally be nova-core and nova-xen-core. I
> would aim for nova-core and nova-libvirt-core, and so on. I do not
> want any +2 responsibility over VMWare/HyperV/Docker drivers since
> they're not my area of expertize - I only look at them today because
> they have no other nova-core representation.
>
> - Not sure if it implies the Nova PTL would be solely focused on
> Nova common. eg would there continue to be one PTL over all virt
> driver implementation projects, or would each project have its
> own PTL. Maybe this is irrelevant if a Czars approach is chosen
> by virt driver projects for their work. I'd be inclined to say
> that a single PTL should stay as a figurehead to represent all
> the virt driver projects, acting as a point of contact to ensure
> we keep communication / co-operation between the drivers in sync.
>
> - A fairly significant amount of nova code would need to be
> considered semi-stable API. Certainly everything under nova/virt
> and any object which is passed in/out of the virt driver API.
> Changes to such APIs would have to be done in a backwards
> compatible manner, since it is no longer possible to lock-step
> change all the virt driver impls. In some ways I think this would
> be a good thing as it will encourage people to put more thought
> into the long term maintainability of nova internal code instead
> of relying on being able to rip it apart later, at will.
>
> - The nova/virt/driver.py class would need to be much better
> specified. All parameters / return values which are opaque dicts
> must be replaced with objects + attributes. Completion of the
> objectification work is mandatory, so there is cleaner separation
> between virt driver impls & the rest of Nova.
>
> - If changes are required to common code, the virt driver developer
> would first have to get the necccessary pieces merged into Nova
> common. Then the follow up virt driver specific changes could be
> proposed to their repo. This implies that some changes to virt
> drivers will still contend for resource in the common nova repo
> and team. This contention should be lower than it is today though
> since the current nova core team should have less code to look
> after per-person on aggregate.
>
> - Changes submitted to nova common code would trigger running of CI
> tests against the external virt drivers. Each virt driver core team
> would decide whether they want their driver to be tested upon Nova
> common changes. Expect that all would choose to be included to the
> same extent that they are today. So level of validation of nova code
> would remain at least at current level. I don't want to reduce the
> amount of code testing here since that's contrary to the direction
> we're taking wrt testing.
>
> - Changes submitted to virt drivers would trigger running CI tests
> that are applicable. eg changes to libvirt driver repo would not
> involve running database migration tests, since all database code
> is isolated in nova. libvirt changes would not trigger vmware,
> xenserver, ironic, etc CI systems. Virt driver changes should
> see fewer false positives in the tests as a result, and those
> that do occur should be more explicitly related to the code being
> proposed. eg a change to vmware is not going to trigger a tempest
> run that uses libvirt, so non-deterministic failures in libvirt
> will no longer plague vmware developers reviews. This would also
> make it possible for VMWare CI to be made gating for changes to
> the VMWare virt driver repository, without negatively impacting
> other virt drivers. So this change should increase testing quality
> for non-libvirt virt drivers and reduce pain of false failures
> for everyone.
>
> - Virt drivers shouldn't use oslo incubator code from nova, since
> that can be replaced any time and isn't upgrade safe. Ideally most
> of the incubator stuff virt drivers need should turn into stable
> oslo APIs. Failing that, virt drivers would need their own copy
> of the incubated code in their module namespace, to avoid clash
> or the need to lock-step upgrade code across separate git repos.
>
> Overall the outcome is that
>
> - Far larger pool of people able to approve changes for merge
> across nova core and the virt driver core teams.
>
> - Faster review & merge for virt driver patches that don't involve
> changes to common nova code, with less CI system testing pain.
>
> - Ability to set priority of work in virt drivers without a 3rd
> party being a bottleneck, where the work doesn't involve changes
> to common nova code.
>
> - Each virt driver team can accept as many features as they feel
> able to deal with, without it negatively impacting amount of
> features that other virt driver teams can accept.
>
> - Virt drivers have flexibility to set their own policies on testing
> without being penalized in the way they then develop their code.
>
>
> The migration
> -------------
>
> Obviously a proposal such as this is a pretty major undertaking. It
> should be clear that it could not be done in a short amount of time.
> It is suggested that it be phased in over two dev cycles. In the Kilo
> release the focus would be on prep work:
>
> - Formalizing the separation between the virt driver impls and the
> rest of the nova codebase. Figure out exactly which areas of
> Nova internal code will need to be marked as 'semi-stable' for
> use by virt drivers, and ensure their APIs are sufficiently
> future proof.
>
> - Discussions with the infrastructure, docs, release, etc teams to
> identify impacts on them and do any required prep work.
>
> - Identify the teams which will lead the new virt driver projects.
> eg core reviewers, PTL or Czars for each job if applicable
>
> - Probably more things I can't think of right now
>
> Then at the start of the Lxxxx release, the virt drivers would
> actually be split out into separate git repos and start their dev
> process for the future. So for bulk of Lxxxx the drivers would be
> on their own. The two Lxxxx rc milestones would allow us to ensure
> our release processes were working well with the split drivers
> before the Lxxxx final release.
>
>
> Final thought
> -------------
>
> Overall consider this a vote of no confidence in nova continuing to
> operate as it does today. As mentioned above this is not intended to
> be disrepectful to the effort every nova core member has put in, just
> a reflection on the changed environment we find ourselves in. Fiddling
> with our processes for the prioritization of work cannot fix the
> fundamental fact that nova core today is a massive single point of
> failure & bottleneck, increasingly crippling the project. The only way
> to address this is by a radical re-organization of our project to
> remove the bottlenecks by modularization of the project & leaders.
> Keeping a single team and adding more/changing process is simply akin
> to shifting deckchairs on the titanic and not a viable option to coninue
> with long term.
>
> Now, I'm realistic. Even with every driver separated out, I expect
> that each of them will individually still have more work proposed
> than their respective core teams have time to review. The new structure
> will, however, make it easier for the core individal teams to grow &
> adapt in ways that suit their specific needs. For self-contained virt
> driver changes it will mean that acceptance of work by one team will
> not take away capacity from another team. Further the burden of
> knowledge required to make it onto a virt driver core team would be
> greatly reduced due to the narrower focus of each core team, so we'll
> be able to promote good talent onto virt driver core teams more quickly.
>
> Thanks for reading so far. Now lets make some real change to prepare
> us for future sustainability & even growth.
>
> Regards,
> Daniel
>
> [1] http://lists.openstack.org/pipermail/openstack-dev/2014-August/044459.html
> [2] There was a ban on changes to nova-network for much of the past two
> cycles. It was relaxed primarily to allow full conversion of nova
> codebase to use objects, not for major new feature development.
> [3] http://lists.openstack.org/pipermail/openstack-dev/2014-July/040443.html
> [4] http://lists.openstack.org/pipermail/openstack-dev/2014-August/043036.html
>
> --
> |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
> |: http://libvirt.org -o- http://virt-manager.org :|
> |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
> |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140910/a25b7586/attachment.pgp>
More information about the OpenStack-dev
mailing list