[openstack-dev] [nova] Averting the Nova crisis by splitting out virt drivers

Sean Dague sean at dague.net
Fri Sep 5 11:49:04 UTC 2014


On 09/05/2014 07:26 AM, Daniel P. Berrange wrote:
> On Fri, Sep 05, 2014 at 07:00:44AM -0400, Sean Dague wrote:
>> On 09/05/2014 06:22 AM, Daniel P. Berrange wrote:
>>> On Fri, Sep 05, 2014 at 07:31:50PM +0930, Christopher Yeoh wrote:
>>>> On Thu, 4 Sep 2014 11:24:29 +0100
>>>> "Daniel P. Berrange" <berrange at redhat.com> wrote:
>>>>>
>>>>>  - A fairly significant amount of nova code would need to be
>>>>>    considered semi-stable API. Certainly everything under nova/virt
>>>>>    and any object which is passed in/out of the virt driver API.
>>>>>    Changes to such APIs would have to be done in a backwards
>>>>>    compatible manner, since it is no longer possible to lock-step
>>>>>    change all the virt driver impls. In some ways I think this would
>>>>>    be a good thing as it will encourage people to put more thought
>>>>>    into the long term maintainability of nova internal code instead
>>>>>    of relying on being able to rip it apart later, at will.
>>>>>
>>>>>  - The nova/virt/driver.py class would need to be much better
>>>>>    specified. All parameters / return values which are opaque dicts
>>>>>    must be replaced with objects + attributes. Completion of the
>>>>>    objectification work is mandatory, so there is cleaner separation
>>>>>    between virt driver impls & the rest of Nova.
>>>>
>>>> I think for this to work well with multiple repositories and drivers
>>>> having different priorities over implementing changes in the API it
>>>> would not just need to be semi-stable, but stable with versioning built
>>>> in from the start to allow for backwards incompatible changes. And
>>>> the interface would have to be very well documented including things
>>>> such as what exceptions are allowed to be raised through the API.
>>>> Hopefully this would be enforced through code as well. But as long as
>>>> driver maintainers are willing to commit to this extra overhead I can
>>>> see it working. 
>>>
>>> With our primary REST or RPC APIs we're under quite strict rules about
>>> what we can & can't change - almost impossible to remove an existing
>>> API from the REST API for example. With the internal virt driver API
>>> we would probably have a little more freedom. For example, I think
>>> if we found an existing virt driver API that was insufficient for a
>>> new bit of work, we could add a new API in parallel with it, give the
>>> virt drivers 1 dev cycle to convert, and then permanently delete the
>>> original virt driver API. So a combination of that kind of API
>>> replacement,  versioning for some data structures/objects, and use of
>>> the capabilties flags would probably be sufficient. That's what I mean
>>> by semi-stable here - no need to maintain existing virt driver APIs
>>> indefinitely - we can remove & replace them in reasonably short time
>>> scales as long as we avoid any lock-step updates.
>>
>> I have spent a lot of time over the last year working on things that
>> require coordinated code lands between projects.... it's much more
>> friction than you give it credit.
>>
>> Every added git tree adds a non linear cost to mental overhead, and a
>> non linear integration cost. Realistically the reason the gate is in the
>> state it is has a ton to do with the fact that it's integrating 40 git
>> trees. Because virt drivers run in the process space of Nova Compute,
>> they can pretty much do whatever, and the impacts are going to be
>> somewhat hard to figure out.
>>
>> Also, if spinning these out seems like the right idea, I think nova-core
>> needs to retain core rights over the drivers as well. Because there do
>> need to be veto authority on some of the worst craziness.
> 
> If they want todo crazy stuff, let them live or die with the
> consequences.
> 
>> If the VMWare team stopped trying to build a distributed lock manager
>> inside their compute driver, or the Hyperv team didn't wait until J2 to
>> start pushing patches, I think there would be more trust in some of
>> these teams. But, I am seriously concerned in both those cases, and the
>> slow review there is a function of a historic lack of trust in judgment.
>> I also personally went on a moratorium a year ago in reviewing either
>> driver because entities at both places where complaining to my
>> management chain through back channels that I was -1ing their code...
> 
> I venture to suggest that the reason we care so much about those kind
> of things is precisely because of our policy of pulling them in the
> tree. Having them in tree means their quality (or not) reflects directly
> on the project as a whole. Separate them from Nova as a whole and give
> them control of their own desinty and they can deal with the consequences
> of their actions and people can judge the results for themselves.
> 
> We don't have the time or resources go continue baby-sitting them
> ourselves - attempting todo so has just resulted in a scenario where
> they end up getting largely ignored as you admit here. This ultimately
> makes their quality even worse, because the lack of reviewer availability
> means they stand little chance of pushing through the work to fix what
> problems they have. We've seen this first hand with the major refactoring
> that vmware driver team has been trying todo. Our current setup where we
> retain veto and try control what other people do as directly resulted in
> the vmware driver suffering poor quality for even longer time. If vmware
> had been out of tree the major refactoring they've been trying to merge
> would have been done 6 months ago, to everyone's benefit. The same is
> true for the libvirt driver - there's plenty of work I'd like todo to
> improve it, but cannot even contemplate because there's little to no
> chance of ever getting it past our fundamental core reviewer bottleneck.

So here's the thing: Nova without any virt drivers is useless. It does
matter if there is some working and good implementation, otherwise Nova
is pointless.

All the libvirt efforts of late seemed to be around adding NFV features
which honestly isn't interesting to me. If there was a debt reduction
push there, I'd definitely sign up to review. But the focus has seemed
to be on a ton of new features instead, so I've not really gone anywhere
near it.

Which I think is a piece of the current bottleneck. People are
effectively -1 voting on a ton of stuff by ignoring it because they
think it's a bad idea (i.e. foot dragging). Maybe that would be an
interesting piece of feedback in the specs process, a 'will probably not
review' comment by Nova core members. I've honestly always wanted a kill
thread in gerrit for this reason, because I think the output of it how
many people are kill threading certain reviews would be spectacular
feedback.

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list