<div dir="ltr"><div>Ian, </div><div><br></div><div>I don't like to write anything personally.</div><div>But I have to write some facts:</div><div><br></div><div>1) I see tons of hands and only 2 solutions my and one more that is based on code. </div>
<div>2) My code was published before session (18. Apr 2013)<br></div><div>3) Blueprints from summit were published (03. Mar 2013) </div><div>4) My Blueprints were published (25. May 2013)</div><div>5) Patches based on my patch were published only (5. Jul 2013)</div>
<div><br></div><div><br></div><div>2 All, </div><div><br></div><div>I don't won't to hide anything from community, cores or PTLs. I have only one goal and it is to make OpenStack better.</div><div><br></div><div>Recently I get new task on my job: Scalability/Performance and Benchmarks<br>
</div><div><br></div>
<div>So with my colleagues we started investigating some code around scheduler. (Jiang sorry for lag in 2 weeks)</div><div><br></div><div>After making investigations and tests we found that one of the reason why scheduler works slow and has problems with scalability is work with DB.</div>
<div>JOINS are pretty unscalable and slow thing and if we add one more JOIN that is required by PCI passthrough we will get much worse situation.<br></div><div><br></div><div>We started investigating how to solve two competing things: Scalability vs Flexibility </div>
<div>About flexibility: </div><div>I don't think that our current scheduler is handy, I have to add 800 lines of code just to be able to use list of JSON objects in scheduler as one more resource (with complex structure). If we don't use new table we should use some kind of key/value, and if we use a lot of key/values we will get problems with performance and scalability or if we store in one key/value we will get another problem with races and tons of dirty code. So we will get the same problems in future. Also using of resources from different providers (such as cinder) are pretty hard tasks.<br>
</div><div><br></div><div>So Alexey K., Alexei O. and me found a way to make our scheduler work without DB with pretty small changes in current solution.</div><div>New approach allows us in the same time to have scalability and flexibility. </div>
<div>What means scalability: "We don't need to store anything about PCI devices in DB". And should just add some small extra code in resource tracker.</div><div><br></div><div>I understand that it is too late to implement such things in H-3 (I absolutely agree with Russell). (Even if they require just 100-150 lines of code.)</div>
<div><br></div><div>So if we implement solution based on my current code, after improving scheduler we should:<br></div><div>1) remove absolutly DB layer</div><div>2) 100% replace compute.api layer</div><div>3) partial replace scheduler layer</div>
<div>4) change compute.manager</div><div>And only libvirt (that should be improved) and auto discovering will be untouched (but they are not enough reviewed in this moment) will be untouched. </div><div><br></div><div><br>
</div><div>So I really don't understand why we should hurry up. Why we are not able firstly to:<br></div><div>1) Prepare patches around improving scheduler (before summit) </div><div>2) Prepare all things that will be untouched (libvirt/auto discovering) (before summit)</div>
<div>3) Speak about all this stuff one more time on summit </div><div>4) Improve and implement all these work in I-1 ?</div><div>5) Test and improve it during I-2 and I-3.</div><div><br></div><div>I think that it will be much better for OpenStack code at all. </div>
<div><br></div><div>If Russell and other cores would like to implement current PCI passthrough approach anyway. </div><div>I won't block anything and tomorrow at evening will be finished DB layer</div><div><br></div><div>
<br></div><div>Best regards,</div><div>Boris Pavlovic</div><div>---</div><div>Mirantis Inc. </div><div><br></div><div><br></div><div class="gmail_extra"><br><br><div class="gmail_quote">
On Mon, Jul 22, 2013 at 8:49 PM, Ian Wells <span dir="ltr"><<a href="mailto:ijw.ubuntu@cack.org.uk" target="_blank">ijw.ubuntu@cack.org.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
Per the last summit, there are many interested parties waiting on PCI<br>
support. Boris (who unfortunately waasn't there) jumped in with an<br>
implementation before the rest of us could get a blueprint up, but I<br>
suspect he's been stretched rather thinly and progress has been much<br>
slower than I was hoping it would be. There are many willing hands<br>
happy to take this work on; perhaps it's time we did, so that we can<br>
get something in before Havana.<br>
<br>
I'm sure we could use a better scheduler. I don't think that actually<br>
affects most of the implementation of passthough and I don't think we<br>
should tie the two together. "The perfect is the enemy of the good."<br>
<br>
And as far as the quantity of data passed back - we've discussed<br>
before that it would be nice (for visibility purposes) to be able to<br>
see an itemised list of all of the allocated and unallocated PCI<br>
resources in the database. There could be quite a lot per host (256<br>
per card x say 10 cards depending on your hardware). But passing that<br>
itemised list back is somewhat of a luxury - in practice, what you<br>
need for scheduling is merely a list of categories of card (those<br>
pools where any one of the PCI cards in the pool would do) and counts.<br>
The compute node should be choosing a card from the pool in any case.<br>
The scheduler need only find a machine with cards available.<br>
<br>
I'm not totally convinced that passing back the itemised list is<br>
necessarily a problem, but in any case we can make the decision one<br>
way or the other, take on the risk if we like, and get the code<br>
written - if it turns out not to be scalable then we can fix *that* in<br>
the next release, but at least we'll have something to play with in<br>
the meantime. Delaying the whole thing to I is just silly.<br>
--<br>
Ian.<br>
<div><div><br>
On 22 July 2013 17:34, Jiang, Yunhong <<a href="mailto:yunhong.jiang@intel.com" target="_blank">yunhong.jiang@intel.com</a>> wrote:<br>
> As for the scalability issue, boris, are you talking about the VF number issue, i.e. A physical PCI devices can at most have 256 virtual functions?<br>
><br>
> I think we have discussed this before. We should keep the compute node to manage the same VF functions, so that VFs belongs to the same PF will have only one entry in DB, with a field indicating the number of free VFs. Thus there will be no scalability issue because the number of PCI slot is limited.<br>
><br>
> We didn't implement this mechanism on current patch set because we agree to make it a enhancement. If it's really a concern, please raise it and we will enhance our resource tracker for this. That's not complex task.<br>
><br>
> Thanks<br>
> --jyh<br>
><br>
>> -----Original Message-----<br>
>> From: Russell Bryant [mailto:<a href="mailto:rbryant@redhat.com" target="_blank">rbryant@redhat.com</a>]<br>
>> Sent: Monday, July 22, 2013 8:22 AM<br>
>> To: Jiang, Yunhong<br>
>> Cc: <a href="mailto:boris@pavlovic.me" target="_blank">boris@pavlovic.me</a>; <a href="mailto:openstack-dev@lists.openstack.org" target="_blank">openstack-dev@lists.openstack.org</a><br>
>> Subject: Re: The PCI support blueprint<br>
>><br>
>> On 07/22/2013 11:17 AM, Jiang, Yunhong wrote:<br>
>> > Hi, Boris<br>
>> > I'm a surprised that you want to postpone the PCI support<br>
>> (<a href="https://blueprints.launchpad.net/nova/+spec/pci-passthrough-base" target="_blank">https://blueprints.launchpad.net/nova/+spec/pci-passthrough-base</a>) to I<br>
>> release. You and our team have been working on this for a long time, and<br>
>> the patches has been reviewed several rounds. And we have been waiting<br>
>> for your DB layer patch for two weeks without any update.<br>
>> ><br>
>> > Can you give more reason why it's pushed to I release? If you are out<br>
>> of bandwidth, we are sure to take it and push it to H release!<br>
>> ><br>
>> > Is it because you want to base your DB layer on your 'A simple way to<br>
>> improve nova scheduler'? That really does not make sense to me. Firstly,<br>
>> that proposal is still under early discussion and get several different voices<br>
>> already, secondly, PCI support is far more than DB layer, it includes<br>
>> resource tracker, scheduler filter, libvirt support enhancement etc. Even if<br>
>> we will change the scheduler that way after I release, we need only<br>
>> change the DB layer, and I don't think that's a big effort!<br>
>><br>
>> Boris mentioned scalability concerns with the current approach on IRC.<br>
>> I'd like more detail.<br>
>><br>
>> In general, if we can see a reasonable path to upgrade what we have now<br>
>> to make it better in the future, then we don't need to block it because<br>
>> of that. If the current approach will result in a large upgrade impact<br>
>> to users to be able to make it better, that would be a reason to hold<br>
>> off. It also depends on how serious the scalability concerns are.<br>
>><br>
>> --<br>
>> Russell Bryant<br>
><br>
</div></div>> _______________________________________________<br>
> OpenStack-dev mailing list<br>
> <a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br>
_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org" target="_blank">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</blockquote></div><br></div></div>