[openstack-dev] The PCI support blueprint
Boris Pavlovic
boris at pavlovic.me
Mon Jul 22 19:08:47 UTC 2013
Ian,
I don't like to write anything personally.
But I have to write some facts:
1) I see tons of hands and only 2 solutions my and one more that is based
on code.
2) My code was published before session (18. Apr 2013)
3) Blueprints from summit were published (03. Mar 2013)
4) My Blueprints were published (25. May 2013)
5) Patches based on my patch were published only (5. Jul 2013)
2 All,
I don't won't to hide anything from community, cores or PTLs. I have only
one goal and it is to make OpenStack better.
Recently I get new task on my job: Scalability/Performance and Benchmarks
So with my colleagues we started investigating some code around scheduler.
(Jiang sorry for lag in 2 weeks)
After making investigations and tests we found that one of the reason why
scheduler works slow and has problems with scalability is work with DB.
JOINS are pretty unscalable and slow thing and if we add one more JOIN that
is required by PCI passthrough we will get much worse situation.
We started investigating how to solve two competing things: Scalability vs
Flexibility
About flexibility:
I don't think that our current scheduler is handy, I have to add 800 lines
of code just to be able to use list of JSON objects in scheduler as one
more resource (with complex structure). If we don't use new table we should
use some kind of key/value, and if we use a lot of key/values we will get
problems with performance and scalability or if we store in one key/value
we will get another problem with races and tons of dirty code. So we will
get the same problems in future. Also using of resources from different
providers (such as cinder) are pretty hard tasks.
So Alexey K., Alexei O. and me found a way to make our scheduler work
without DB with pretty small changes in current solution.
New approach allows us in the same time to have scalability and
flexibility.
What means scalability: "We don't need to store anything about PCI devices
in DB". And should just add some small extra code in resource tracker.
I understand that it is too late to implement such things in H-3 (I
absolutely agree with Russell). (Even if they require just 100-150 lines of
code.)
So if we implement solution based on my current code, after improving
scheduler we should:
1) remove absolutly DB layer
2) 100% replace compute.api layer
3) partial replace scheduler layer
4) change compute.manager
And only libvirt (that should be improved) and auto discovering will be
untouched (but they are not enough reviewed in this moment) will be
untouched.
So I really don't understand why we should hurry up. Why we are not able
firstly to:
1) Prepare patches around improving scheduler (before summit)
2) Prepare all things that will be untouched (libvirt/auto discovering)
(before summit)
3) Speak about all this stuff one more time on summit
4) Improve and implement all these work in I-1 ?
5) Test and improve it during I-2 and I-3.
I think that it will be much better for OpenStack code at all.
If Russell and other cores would like to implement current PCI passthrough
approach anyway.
I won't block anything and tomorrow at evening will be finished DB layer
Best regards,
Boris Pavlovic
---
Mirantis Inc.
On Mon, Jul 22, 2013 at 8:49 PM, Ian Wells <ijw.ubuntu at cack.org.uk> wrote:
> Per the last summit, there are many interested parties waiting on PCI
> support. Boris (who unfortunately waasn't there) jumped in with an
> implementation before the rest of us could get a blueprint up, but I
> suspect he's been stretched rather thinly and progress has been much
> slower than I was hoping it would be. There are many willing hands
> happy to take this work on; perhaps it's time we did, so that we can
> get something in before Havana.
>
> I'm sure we could use a better scheduler. I don't think that actually
> affects most of the implementation of passthough and I don't think we
> should tie the two together. "The perfect is the enemy of the good."
>
> And as far as the quantity of data passed back - we've discussed
> before that it would be nice (for visibility purposes) to be able to
> see an itemised list of all of the allocated and unallocated PCI
> resources in the database. There could be quite a lot per host (256
> per card x say 10 cards depending on your hardware). But passing that
> itemised list back is somewhat of a luxury - in practice, what you
> need for scheduling is merely a list of categories of card (those
> pools where any one of the PCI cards in the pool would do) and counts.
> The compute node should be choosing a card from the pool in any case.
> The scheduler need only find a machine with cards available.
>
> I'm not totally convinced that passing back the itemised list is
> necessarily a problem, but in any case we can make the decision one
> way or the other, take on the risk if we like, and get the code
> written - if it turns out not to be scalable then we can fix *that* in
> the next release, but at least we'll have something to play with in
> the meantime. Delaying the whole thing to I is just silly.
> --
> Ian.
>
> On 22 July 2013 17:34, Jiang, Yunhong <yunhong.jiang at intel.com> wrote:
> > As for the scalability issue, boris, are you talking about the VF number
> issue, i.e. A physical PCI devices can at most have 256 virtual functions?
> >
> > I think we have discussed this before. We should keep the compute node
> to manage the same VF functions, so that VFs belongs to the same PF will
> have only one entry in DB, with a field indicating the number of free VFs.
> Thus there will be no scalability issue because the number of PCI slot is
> limited.
> >
> > We didn't implement this mechanism on current patch set because we agree
> to make it a enhancement. If it's really a concern, please raise it and we
> will enhance our resource tracker for this. That's not complex task.
> >
> > Thanks
> > --jyh
> >
> >> -----Original Message-----
> >> From: Russell Bryant [mailto:rbryant at redhat.com]
> >> Sent: Monday, July 22, 2013 8:22 AM
> >> To: Jiang, Yunhong
> >> Cc: boris at pavlovic.me; openstack-dev at lists.openstack.org
> >> Subject: Re: The PCI support blueprint
> >>
> >> On 07/22/2013 11:17 AM, Jiang, Yunhong wrote:
> >> > Hi, Boris
> >> > I'm a surprised that you want to postpone the PCI support
> >> (https://blueprints.launchpad.net/nova/+spec/pci-passthrough-base) to I
> >> release. You and our team have been working on this for a long time, and
> >> the patches has been reviewed several rounds. And we have been waiting
> >> for your DB layer patch for two weeks without any update.
> >> >
> >> > Can you give more reason why it's pushed to I release? If you are
> out
> >> of bandwidth, we are sure to take it and push it to H release!
> >> >
> >> > Is it because you want to base your DB layer on your 'A simple
> way to
> >> improve nova scheduler'? That really does not make sense to me. Firstly,
> >> that proposal is still under early discussion and get several different
> voices
> >> already, secondly, PCI support is far more than DB layer, it includes
> >> resource tracker, scheduler filter, libvirt support enhancement etc.
> Even if
> >> we will change the scheduler that way after I release, we need only
> >> change the DB layer, and I don't think that's a big effort!
> >>
> >> Boris mentioned scalability concerns with the current approach on IRC.
> >> I'd like more detail.
> >>
> >> In general, if we can see a reasonable path to upgrade what we have now
> >> to make it better in the future, then we don't need to block it because
> >> of that. If the current approach will result in a large upgrade impact
> >> to users to be able to make it better, that would be a reason to hold
> >> off. It also depends on how serious the scalability concerns are.
> >>
> >> --
> >> Russell Bryant
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130722/7d5c79fd/attachment.html>
More information about the OpenStack-dev
mailing list