[openstack-dev] [nova] Averting the Nova crisis by splitting out virt drivers

Daniel P. Berrange berrange at redhat.com
Fri Sep 5 12:10:26 UTC 2014


On Fri, Sep 05, 2014 at 07:49:04AM -0400, Sean Dague wrote:
> On 09/05/2014 07:26 AM, Daniel P. Berrange wrote:
> > On Fri, Sep 05, 2014 at 07:00:44AM -0400, Sean Dague wrote:
> >> On 09/05/2014 06:22 AM, Daniel P. Berrange wrote:
> >>> On Fri, Sep 05, 2014 at 07:31:50PM +0930, Christopher Yeoh wrote:
> >>>> On Thu, 4 Sep 2014 11:24:29 +0100
> >>>> "Daniel P. Berrange" <berrange at redhat.com> wrote:
> >>>>>
> >>>>>  - A fairly significant amount of nova code would need to be
> >>>>>    considered semi-stable API. Certainly everything under nova/virt
> >>>>>    and any object which is passed in/out of the virt driver API.
> >>>>>    Changes to such APIs would have to be done in a backwards
> >>>>>    compatible manner, since it is no longer possible to lock-step
> >>>>>    change all the virt driver impls. In some ways I think this would
> >>>>>    be a good thing as it will encourage people to put more thought
> >>>>>    into the long term maintainability of nova internal code instead
> >>>>>    of relying on being able to rip it apart later, at will.
> >>>>>
> >>>>>  - The nova/virt/driver.py class would need to be much better
> >>>>>    specified. All parameters / return values which are opaque dicts
> >>>>>    must be replaced with objects + attributes. Completion of the
> >>>>>    objectification work is mandatory, so there is cleaner separation
> >>>>>    between virt driver impls & the rest of Nova.
> >>>>
> >>>> I think for this to work well with multiple repositories and drivers
> >>>> having different priorities over implementing changes in the API it
> >>>> would not just need to be semi-stable, but stable with versioning built
> >>>> in from the start to allow for backwards incompatible changes. And
> >>>> the interface would have to be very well documented including things
> >>>> such as what exceptions are allowed to be raised through the API.
> >>>> Hopefully this would be enforced through code as well. But as long as
> >>>> driver maintainers are willing to commit to this extra overhead I can
> >>>> see it working. 
> >>>
> >>> With our primary REST or RPC APIs we're under quite strict rules about
> >>> what we can & can't change - almost impossible to remove an existing
> >>> API from the REST API for example. With the internal virt driver API
> >>> we would probably have a little more freedom. For example, I think
> >>> if we found an existing virt driver API that was insufficient for a
> >>> new bit of work, we could add a new API in parallel with it, give the
> >>> virt drivers 1 dev cycle to convert, and then permanently delete the
> >>> original virt driver API. So a combination of that kind of API
> >>> replacement,  versioning for some data structures/objects, and use of
> >>> the capabilties flags would probably be sufficient. That's what I mean
> >>> by semi-stable here - no need to maintain existing virt driver APIs
> >>> indefinitely - we can remove & replace them in reasonably short time
> >>> scales as long as we avoid any lock-step updates.
> >>
> >> I have spent a lot of time over the last year working on things that
> >> require coordinated code lands between projects.... it's much more
> >> friction than you give it credit.
> >>
> >> Every added git tree adds a non linear cost to mental overhead, and a
> >> non linear integration cost. Realistically the reason the gate is in the
> >> state it is has a ton to do with the fact that it's integrating 40 git
> >> trees. Because virt drivers run in the process space of Nova Compute,
> >> they can pretty much do whatever, and the impacts are going to be
> >> somewhat hard to figure out.
> >>
> >> Also, if spinning these out seems like the right idea, I think nova-core
> >> needs to retain core rights over the drivers as well. Because there do
> >> need to be veto authority on some of the worst craziness.
> > 
> > If they want todo crazy stuff, let them live or die with the
> > consequences.
> > 
> >> If the VMWare team stopped trying to build a distributed lock manager
> >> inside their compute driver, or the Hyperv team didn't wait until J2 to
> >> start pushing patches, I think there would be more trust in some of
> >> these teams. But, I am seriously concerned in both those cases, and the
> >> slow review there is a function of a historic lack of trust in judgment.
> >> I also personally went on a moratorium a year ago in reviewing either
> >> driver because entities at both places where complaining to my
> >> management chain through back channels that I was -1ing their code...
> > 
> > I venture to suggest that the reason we care so much about those kind
> > of things is precisely because of our policy of pulling them in the
> > tree. Having them in tree means their quality (or not) reflects directly
> > on the project as a whole. Separate them from Nova as a whole and give
> > them control of their own desinty and they can deal with the consequences
> > of their actions and people can judge the results for themselves.
> > 
> > We don't have the time or resources go continue baby-sitting them
> > ourselves - attempting todo so has just resulted in a scenario where
> > they end up getting largely ignored as you admit here. This ultimately
> > makes their quality even worse, because the lack of reviewer availability
> > means they stand little chance of pushing through the work to fix what
> > problems they have. We've seen this first hand with the major refactoring
> > that vmware driver team has been trying todo. Our current setup where we
> > retain veto and try control what other people do as directly resulted in
> > the vmware driver suffering poor quality for even longer time. If vmware
> > had been out of tree the major refactoring they've been trying to merge
> > would have been done 6 months ago, to everyone's benefit. The same is
> > true for the libvirt driver - there's plenty of work I'd like todo to
> > improve it, but cannot even contemplate because there's little to no
> > chance of ever getting it past our fundamental core reviewer bottleneck.
> 
> So here's the thing: Nova without any virt drivers is useless. It does
> matter if there is some working and good implementation, otherwise Nova
> is pointless.
> 
> All the libvirt efforts of late seemed to be around adding NFV features
> which honestly isn't interesting to me. If there was a debt reduction
> push there, I'd definitely sign up to review. But the focus has seemed
> to be on a ton of new features instead, so I've not really gone anywhere
> near it.

The fact that core reviewers like yourself are explicitly admitting
that you're completely ignoring virt driver reviews, is precisely
why nova is doomed as it is. The drivers are beholden to core reviewers,
yet being explicitly ignored and not given the option to self-organize
themselves. That's just an insane way to run a community.

Characterizing all the work as being around NFV features is really
also very misleading and dispectful to the many contributors we have
had wanting to do a wide variety of work. For a start, the NUMA
feature(s) are not solely motivated by NFV needs. Intelligent NUMA
placement is something libvirt has needed for years, so that it can
stop such awful wastage of compute resource on NUMA hardware which
is omnipresent these days. It happens to be important to NFV too
which is great for those people who care about that. Similarly the
large pages and CPU pinning proposals are all about allowing for
improved control over resource usage in libvirt, which is again
broadly useful to anyone deploying Nova.  Finally if we didn't do
this work on NUMA vendors have clearly stated that they would fork
nova and go and maintain a telco-only version of nova, which would
be an even bigger disaster for the community as a whole. We cannot
afford to simply ignore all features while working on techniccal
debt, there has to be a balance.

If you want to talk technical debt in libvirt specifically, here's
some actual information on stuff I've tried to do for Juno

 - libvirt-driver-class-refactor.rst - refactoring the libvirt
   driver to make its maintenance more bearable. Up for review
   since Jul 5 with little to no feedback. Review welcome

 - libvirt-domain-listing-speedup.rst - improving performance of
   the libvirt driver and removing race conditions in current
   code by making it use the correct libvirt APIs for listing
   guests. By a miracle this one actually merged.

 - Not a blueprint, but I've had another series waiting for review
   to fix inconsistent data reporting for supported instances across
   the virt drivers.

    https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:virt-constants,n,z

   if those patches ever get merged I've got alot more cleanup work
   ready to follow in the get_available_resources() method. So again
   review welcome

And that's not even considering the effort that many other contributors
are trying to put into improving the code here.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|



More information about the OpenStack-dev mailing list