[openstack-dev] The PCI support blueprint
boris at pavlovic.me
Mon Jul 22 20:09:13 UTC 2013
Hi one more time.
I will refactor DB layer tomorrow. As I said I don't want to be a block.
On Mon, Jul 22, 2013 at 11:08 PM, Boris Pavlovic <boris at pavlovic.me> wrote:
> I don't like to write anything personally.
> But I have to write some facts:
> 1) I see tons of hands and only 2 solutions my and one more that is based
> on code.
> 2) My code was published before session (18. Apr 2013)
> 3) Blueprints from summit were published (03. Mar 2013)
> 4) My Blueprints were published (25. May 2013)
> 5) Patches based on my patch were published only (5. Jul 2013)
> 2 All,
> I don't won't to hide anything from community, cores or PTLs. I have only
> one goal and it is to make OpenStack better.
> Recently I get new task on my job: Scalability/Performance and Benchmarks
> So with my colleagues we started investigating some code around scheduler.
> (Jiang sorry for lag in 2 weeks)
> After making investigations and tests we found that one of the reason why
> scheduler works slow and has problems with scalability is work with DB.
> JOINS are pretty unscalable and slow thing and if we add one more JOIN
> that is required by PCI passthrough we will get much worse situation.
> We started investigating how to solve two competing things: Scalability vs
> About flexibility:
> I don't think that our current scheduler is handy, I have to add 800 lines
> of code just to be able to use list of JSON objects in scheduler as one
> more resource (with complex structure). If we don't use new table we should
> use some kind of key/value, and if we use a lot of key/values we will get
> problems with performance and scalability or if we store in one key/value
> we will get another problem with races and tons of dirty code. So we will
> get the same problems in future. Also using of resources from different
> providers (such as cinder) are pretty hard tasks.
> So Alexey K., Alexei O. and me found a way to make our scheduler work
> without DB with pretty small changes in current solution.
> New approach allows us in the same time to have scalability and
> What means scalability: "We don't need to store anything about PCI devices
> in DB". And should just add some small extra code in resource tracker.
> I understand that it is too late to implement such things in H-3 (I
> absolutely agree with Russell). (Even if they require just 100-150 lines of
> So if we implement solution based on my current code, after improving
> scheduler we should:
> 1) remove absolutly DB layer
> 2) 100% replace compute.api layer
> 3) partial replace scheduler layer
> 4) change compute.manager
> And only libvirt (that should be improved) and auto discovering will be
> untouched (but they are not enough reviewed in this moment) will be
> So I really don't understand why we should hurry up. Why we are not able
> firstly to:
> 1) Prepare patches around improving scheduler (before summit)
> 2) Prepare all things that will be untouched (libvirt/auto discovering)
> (before summit)
> 3) Speak about all this stuff one more time on summit
> 4) Improve and implement all these work in I-1 ?
> 5) Test and improve it during I-2 and I-3.
> I think that it will be much better for OpenStack code at all.
> If Russell and other cores would like to implement current PCI passthrough
> approach anyway.
> I won't block anything and tomorrow at evening will be finished DB layer
> Best regards,
> Boris Pavlovic
> Mirantis Inc.
> On Mon, Jul 22, 2013 at 8:49 PM, Ian Wells <ijw.ubuntu at cack.org.uk> wrote:
>> Per the last summit, there are many interested parties waiting on PCI
>> support. Boris (who unfortunately waasn't there) jumped in with an
>> implementation before the rest of us could get a blueprint up, but I
>> suspect he's been stretched rather thinly and progress has been much
>> slower than I was hoping it would be. There are many willing hands
>> happy to take this work on; perhaps it's time we did, so that we can
>> get something in before Havana.
>> I'm sure we could use a better scheduler. I don't think that actually
>> affects most of the implementation of passthough and I don't think we
>> should tie the two together. "The perfect is the enemy of the good."
>> And as far as the quantity of data passed back - we've discussed
>> before that it would be nice (for visibility purposes) to be able to
>> see an itemised list of all of the allocated and unallocated PCI
>> resources in the database. There could be quite a lot per host (256
>> per card x say 10 cards depending on your hardware). But passing that
>> itemised list back is somewhat of a luxury - in practice, what you
>> need for scheduling is merely a list of categories of card (those
>> pools where any one of the PCI cards in the pool would do) and counts.
>> The compute node should be choosing a card from the pool in any case.
>> The scheduler need only find a machine with cards available.
>> I'm not totally convinced that passing back the itemised list is
>> necessarily a problem, but in any case we can make the decision one
>> way or the other, take on the risk if we like, and get the code
>> written - if it turns out not to be scalable then we can fix *that* in
>> the next release, but at least we'll have something to play with in
>> the meantime. Delaying the whole thing to I is just silly.
>> On 22 July 2013 17:34, Jiang, Yunhong <yunhong.jiang at intel.com> wrote:
>> > As for the scalability issue, boris, are you talking about the VF
>> number issue, i.e. A physical PCI devices can at most have 256 virtual
>> > I think we have discussed this before. We should keep the compute node
>> to manage the same VF functions, so that VFs belongs to the same PF will
>> have only one entry in DB, with a field indicating the number of free VFs.
>> Thus there will be no scalability issue because the number of PCI slot is
>> > We didn't implement this mechanism on current patch set because we
>> agree to make it a enhancement. If it's really a concern, please raise it
>> and we will enhance our resource tracker for this. That's not complex task.
>> > Thanks
>> > --jyh
>> >> -----Original Message-----
>> >> From: Russell Bryant [mailto:rbryant at redhat.com]
>> >> Sent: Monday, July 22, 2013 8:22 AM
>> >> To: Jiang, Yunhong
>> >> Cc: boris at pavlovic.me; openstack-dev at lists.openstack.org
>> >> Subject: Re: The PCI support blueprint
>> >> On 07/22/2013 11:17 AM, Jiang, Yunhong wrote:
>> >> > Hi, Boris
>> >> > I'm a surprised that you want to postpone the PCI support
>> >> (https://blueprints.launchpad.net/nova/+spec/pci-passthrough-base) to
>> >> release. You and our team have been working on this for a long time,
>> >> the patches has been reviewed several rounds. And we have been waiting
>> >> for your DB layer patch for two weeks without any update.
>> >> >
>> >> > Can you give more reason why it's pushed to I release? If you
>> are out
>> >> of bandwidth, we are sure to take it and push it to H release!
>> >> >
>> >> > Is it because you want to base your DB layer on your 'A simple
>> way to
>> >> improve nova scheduler'? That really does not make sense to me.
>> >> that proposal is still under early discussion and get several
>> different voices
>> >> already, secondly, PCI support is far more than DB layer, it includes
>> >> resource tracker, scheduler filter, libvirt support enhancement etc.
>> Even if
>> >> we will change the scheduler that way after I release, we need only
>> >> change the DB layer, and I don't think that's a big effort!
>> >> Boris mentioned scalability concerns with the current approach on IRC.
>> >> I'd like more detail.
>> >> In general, if we can see a reasonable path to upgrade what we have now
>> >> to make it better in the future, then we don't need to block it because
>> >> of that. If the current approach will result in a large upgrade impact
>> >> to users to be able to make it better, that would be a reason to hold
>> >> off. It also depends on how serious the scalability concerns are.
>> >> --
>> >> Russell Bryant
>> > _______________________________________________
>> > OpenStack-dev mailing list
>> > OpenStack-dev at lists.openstack.org
>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev