Open Stack

Mon Dec 23 13:34:39 UTC 2013

Hi,
Is there ‘PCI pass-through network’ IRC meeting tomorrow?

BR,
Irena
From: Robert Li (baoli) [mailto:baoli at cisco.com]
Sent: Tuesday, December 17, 2013 5:32 PM
To: Sandhya Dasu (sadasu); OpenStack Development Mailing List (not for usage questions); Jiang, Yunhong; Irena Berezovsky; prashant.upadhyaya at aricent.com; chris.friesen at windriver.com; Itzik Brown; john at johngarbutt.com
Subject: Re: [openstack-dev] [nova] [neutron] Todays' meeting log: PCI pass-through network support

Sorry guys, I didn't #startmeeting before the meeting. But here is the log from today's meeting. Updated the subject a bit.

<irenab> baoli: hi
[08:57] <baoli> Hi Irena
[08:57] == tedross [tedross at nat/redhat/x-culmgvjwkhbnuyww] has joined #openstack-meeting-alt
[08:58] <irenab> baoli: unfortunately I cannot participate actively today, will try to follow the log and email later to day
[08:59] <baoli> ok
[09:00] == natishalom [~qicruser at 2.55.138.181] has joined #openstack-meeting-alt
[09:00] == HenryG [~HenryG at nat/cisco/x-aesrcihoscocixap] has joined #openstack-meeting-alt
[09:00] == tedross [tedross at nat/redhat/x-culmgvjwkhbnuyww] has left #openstack-meeting-alt []
[09:01] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 264 seconds]
[09:01] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:03] <baoli> Hi, is Yongli there?
[09:04] == yjiang51 [yjiang5 at nat/intel/x-uobnfwflcweybytj] has joined #openstack-meeting-alt
[09:04] == jdob [~jdob at c-50-166-75-72.hsd1.nj.comcast.net] has quit [Quit: Leaving]
[09:04] == jdob_ [~jdob at c-50-166-75-72.hsd1.nj.comcast.net] has joined #openstack-meeting-alt
[09:04] <yjiang51> baoli: hi
[09:05] == hajay__ [~hajay at 99-46-140-220.lightspeed.sntcca.sbcglobal.net] has joined #openstack-meeting-alt
[09:05] <baoli> yjang: hi
[09:05] <yjiang51> baoli: do we have the meeting?
[09:05] <baoli> Yes, it's on. Hopefully, Yongli will join
[09:06] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 260 seconds]
[09:07] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:07] <yjiang51> baoli: got it and thanks
[09:07] == natishalom [~qicruser at 2.55.138.181] has quit [Ping timeout: 252 seconds]
[09:07] == heyongli [~yhe at 221.216.132.130] has joined #openstack-meeting-alt
[09:07] <baoli> yhe, HI
[09:08] <heyongli> hello, every one
[09:08] <yjiang51> heyongli: hi
[09:08] <baoli> Hi everyone, let's start
[09:08] == hajay_ [~hajay at 66.129.239.12] has quit [Ping timeout: 252 seconds]
[09:08] <baoli> Yongli has summarized his wiki with his email
[09:09] <heyongli> i just arrived home from hospital, sorry late
[09:09] == hajay__ [~hajay at 99-46-140-220.lightspeed.sntcca.sbcglobal.net] has quit [Ping timeout: 264 seconds]
[09:10] <baoli> yhe, np. Hopefully, you are well
[09:10] == lsmola_ [~Ladas at ip-94-112-129-242.net.upcbroadband.cz] has joined #openstack-meeting-alt
[09:10] <heyongli> my, son.  so i think you might worry about he use case right?
[09:10] <baoli> Can we start with pci-flaovr/pci-group definition? Do we agree that they are the same?
[09:11] <heyongli> in my brain, it's a filter with name, but in the flat dict structure, no sub pci-filter
[09:12] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 264 seconds]
[09:12] <baoli> Well, we want to agree conceptually.
[09:12] == BrianB_ [4066f90e at gateway/web/freenode/ip.64.102.249.14] has joined #openstack-meeting-alt
[09:13] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:13] <heyongli> cause for me it's just a  the white list with name, so conceptually it's simple, can be describe clear in this way
[09:14] <baoli> Ok. So, they all define a group of devices with similar properties.
[09:15] <heyongli> agree
[09:15] <baoli> great
[09:16] <heyongli> any other concern for the flavor?
[09:16] <baoli> Now, it seems to me that pci-flavor can be defined by both nova API and by means of configuration
[09:16] <baoli> from your email
[09:16] <heyongli> config is going to fade out
[09:17] <heyongli> for config fade out, any concern?
[09:17] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 245 seconds]
[09:17] <baoli> in your email, what is "admin config sriov"?
[09:17] <heyongli> just mean this step is done by admin
[09:17] == abramley [~abramley at 69.38.149.98] has joined #openstack-meeting-alt
[09:18] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:18] <heyongli> John want the picture for user and for admin is clearly defined
[09:18] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Remote host closed the connection]
[09:18] == jdob_ [~jdob at c-50-166-75-72.hsd1.nj.comcast.net] has quit [Quit: Leaving]
[09:18] == jdob [~jdob at c-50-166-75-72.hsd1.nj.comcast.net] has joined #openstack-meeting-alt
[09:19] <baoli> We have some concerns over phasing out the configuration
[09:19] <baoli> Did you check the log from last meeting?
[09:19] <heyongli> i do, but not see the strong reason
[09:20] <baoli> How is it in your mind the nova pci-flavor-update is going to be used?
[09:20] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:20] <heyongli> just the the whole content for the filter
[09:21] <baoli> Well, I'd like to know who is going to invoke it and when
[09:21] <heyongli> toltaly replace or set the new defination for the flavor
[09:21] == ijw [~ijw at nat/cisco/x-urnealzfvlrtqrbx] has joined #openstack-meeting-alt
[09:21] <heyongli> define this , then the device is pass the whitelist and got group into a flavor
[09:22] <ijw> Soirry I'm late
[09:22] == banix [banix at nat/ibm/x-bhsigoejtesvdhwi] has joined #openstack-meeting-alt
[09:22] <baoli> ijw: np
[09:22] == eankutse [~Adium at 50.56.230.39] has joined #openstack-meeting-alt
[09:22] == eankutse1 [~Adium at 50.57.17.244] has joined #openstack-meeting-alt
[09:22] == eankutse [~Adium at 50.56.230.39] has quit [Read error: No buffer space available]
[09:23] <heyongli> this is just the whitelist's DB version, via API
[09:24] <ijw> Apologies for jumping in, but did we do the API/no-API discussion yet?
[09:24] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 245 seconds]
[09:24] <heyongli> current topic
[09:25] <baoli> heyongli: let's assume a new compute node is added, what do you do to provision it?
[09:25] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:25] <heyongli> 2.1.1 admin check PCI devices present per host
[09:25] <ijw> I would ask, given that Openstack's design tenets are all about decentralising where possible, why would you centralise the entirety of the PCI information?
[09:26] <ijw> Have to admit I came a bit late to that document - because all the work was going on in the other doducment
[09:26] <ijw> Which didn't mention this at all
[09:26] <heyongli> this is not relevent to tenet, it's admin's work
[09:27] <ijw> It's actually not the problem.  It's not that it's not relevant to the tenant, it's why you have to actively do anything to add a compute node at all.  In every other respect a compute node joins the cluster with no activity
[09:27] == yamahata__ [~yamahata at 192.55.55.39] has quit [Ping timeout: 240 seconds]
[09:28] <ijw> So, for instance, I boot a compute node, RAM goes up, disk goes up, CPUs go up, but I've not had to edit a central table to do that, the compute node reports in and it just happens.
[09:28] == abramley [~abramley at 69.38.149.98] has quit [Quit: abramley]
[09:28] <ijw> I like this - it means when I provision a cluster I just have to get each node to provision correctly and the cluster is up.  Conversely when the node goes down the resources go away.
[09:28] == yamahata__ [yamahata at nat/intel/x-hvbvnjztdhymckzk] has joined #openstack-meeting-alt
[09:28] == esker [~esker at rrcs-67-79-207-12.sw.biz.rr.com] has joined #openstack-meeting-alt
[09:29] == denis_makogon [~dmakogon at 194.213.110.67] has quit [Ping timeout: 240 seconds]
[09:29] <heyongli> cause pci-flavor is  global, you don't need to config it specifically,
[09:29] <ijw> So I would strongly argue that the nodes should decide what PCI passthrough devices they have, independently and without reference to central authority.
[09:29] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 252 seconds]
[09:30] <ijw> Yes, but that says that all my nodes are either identical or similar, and while that may be true it makes more sense to keep that configuration on and with the machine rather than in a central DB just in case it's not.
[09:30] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:30] <heyongli> ijw: suppose you had 500 server's bring in, all with same configration, like same slot for a same pci device
[09:31] <ijw> Yup, then I would boot them up all with the same config file on each, same as I install the same software on each.  That's a devops problem and it's got plenty of solutions.
[09:31] <baoli> heyongli, a pci-flaovr is a global name. But what's part of a pci-flaovr is a matter of the compute host that supports that flavor
[09:31] == julim [~julim at pool-173-76-179-202.bstnma.fios.verizon.net] has joined #openstack-meeting-alt
[09:31] == ruhe [~ruhe at 91.207.132.76] has quit [Ping timeout: 246 seconds]
[09:31] <heyongli> then you got this flow to easily bring all them up ready for pci: export the flavor in aggreate
[09:31] == shakayumi [~shakayumi at 156.39.10.22] has quit [Ping timeout: 250 seconds]
[09:31] <ijw> heyongli: If I were doing this with puppet, or chef, or ansible, or whatever, I would work out what type of host I had and put a config on it to suit.  This is solving a problem that doesn't exist.
[09:32] == jmaron [~jmaron at pool-173-61-178-93.cmdnnj.fios.verizon.net] has joined #openstack-meeting-alt
[09:32] <ijw> And aggregates divide machines by location, generally, not type.
[09:32] == yamahata [~yamahata at i193022.dynamic.ppp.asahi-net.or.jp] has quit [Read error: Connection timed out]
[09:33] == aignatov [~aignatov at 91.207.132.72] has quit [Ping timeout: 245 seconds]
[09:33] <ijw> In summary, do not like.  I don't understand why it's a good idea to use APIs to describe basic hardware details.
[09:33] <baoli> yeyongli: I think that you agreed the aggregate is a high level construct. It has nothing to do with how a compute node decides what devices belong to which pci-flavor/pci-group
[09:33] <heyongli> i might wrong, but aggregate bp say it's a sub group of hosts with same property that's why aggregate's meta data and scheduler do it's work
[09:33] == denis_makogon [~dmakogon at 194.213.110.67] has joined #openstack-meeting-alt
[09:33] == markmcclain [~markmccla at c-98-242-72-116.hsd1.ga.comcast.net] has quit [Quit: Leaving.]
[09:34] == yamahata [~yamahata at i193022.dynamic.ppp.asahi-net.or.jp] has joined #openstack-meeting-alt
[09:34] == irenab [c12fa5fb at gateway/web/freenode/ip.193.47.165.251] has quit [Ping timeout: 272 seconds]
[09:34] <ijw> Aggregates are there for scheduling, though, not provisioning
[09:34] == natishalom [~qicruser at 62.90.11.161] has joined #openstack-meeting-alt
[09:34] == aignatov [~aignatov at 91.207.132.76] has joined #openstack-meeting-alt
[09:34] <baoli> yeyongli: i have no problem with nova pci-flavor-create, but with nova pci-flavor-update
[09:34] == natishalom [~qicruser at 62.90.11.161] has quit [Client Quit]
[09:34] <baoli> so, aggregate can still work
[09:34] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 248 seconds]
[09:34] <ijw> I have a problem with using APIs and the database to do this *at all*.
[09:35] <heyongli> what's that?
[09:35] <ijw> That we shouldn't be storing this information centrally.  This is exactly what per-host config files are for.
[09:36] <baoli> ijw: let's focus on the API versus configuration. Not diverage to use of DB.
[09:36] <ijw> Also, this is not something that changes on a whim, it changes precisely and only when the hardware in your cluster changes, so it seems to me that using a config file will make that happen per the devops comments above, and using APIs is solving a problem that doesn't really exist.
[09:37] <heyongli> acctually i argued for the aggregate is is for  provisioning, failed
[09:37] <ijw> baoli: there's no disctinction to speak of.  The APIs clearly change a data model that lives somewhere that is not on the individual compute hosts.
[09:38] <ijw> So, why do we need this to be changeable by API at all, and why should the information be stored centrally?  These are the two questions I want answers to for this proposal to make sense.
[09:38] <heyongli> hi, ijw, if use per host setting there still need a central thing: the alias, but alias is fade out also
[09:39] <ijw> No, you don't, you can work out aliases/groups/whatever by what compute hosts report.  Only the scheduler needs to know it and it can work it out on the fly.
[09:39] <heyongli> so global flavor combined the whitelist and flavor
[09:39] <heyongli> if no global thing, how do you know there is 'sth' to be ready for use?
[09:39] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:40] <ijw> That's what the scheduler does.  Practically speaking you never know if you can schedule a machine until you schedule a machine.
[09:40] <yjiang51> ijw: heyongli, I think we need persuade john if we have anything different. Is it possible to get John on this meeting?
[09:40] <ijw> The only difference in what you're saying is that you couldn't validate a launch command against groups when it's placed, and that's certainly a weakness, but not a very big one.
[09:41] <heyongli> ijw: no, you must provide you request to scheduele, so how do you want tell the schedule what you want?
[09:41] <ijw> Which John?
[09:41] <ijw> extra_specs in the flavor.
[09:41] <ijw> Listing PCI aliases and counts rather than PCI flavors.
[09:42] <ijw> This assumes that your aliases are named by string so that you can refer to them (which is an idea I largely stole from the way provider network work, btw)
[09:43] <baoli> heyongli: I guess that we didn't do a good job in the google doc in describing how the pci-group works. Otherwise, it describes exactly why alias is not needed, and pci-group should work
[09:43] <ijw> So, in my scheme: 1. you tell the compute host that PCI device x is usable by passthrough with flavor 'fred'.  You schedule a machine requesting one of 'fred' in its flavor, and the scheduler finds the host.  This is back to the simple mechanism we have now, I don't really thing it needs complicating.
[09:44] <ijw> Sorry, s/flavor/group/ in the first location that last comment.
[09:44] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 240 seconds]
[09:44] == ruhe [~ruhe at 91.207.132.72] has joined #openstack-meeting-alt
[09:45] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:45] == heyongli [~yhe at 221.216.132.130] has quit [Ping timeout: 248 seconds]
[09:46] == esker [~esker at rrcs-67-79-207-12.sw.biz.rr.com] has quit [Remote host closed the connection]
[09:46] == esker [~esker at 198.95.226.40] has joined #openstack-meeting-alt
[09:47] == demorris [~daniel.mo at rrcs-67-78-97-126.sw.biz.rr.com] has joined #openstack-meeting-alt
[09:47] <ijw> Bad moment time for network trouble…
[09:47] <yjiang51> ijw: yes, seems he lose the connection
[09:48] == mtreinish [~mtreinish at pool-173-62-56-236.pghkny.fios.verizon.net] has quit [Ping timeout: 272 seconds]
[09:49] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 248 seconds]
[09:50] <yjiang51> ijw: but I agree that if we need create pci flavor each time to make compute node's PCI information available seems not so straightforward.
[09:51] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:51] == heyongli [~yhe at 221.216.132.130] has joined #openstack-meeting-alt
[09:51] == mtreinish [~mtreinish at pool-173-62-56-236.pghkny.fios.verizon.net] has joined #openstack-meeting-alt
[09:51] <ijw> Well, turning this around the other way, if you described the groups of PCI devices that a compute node was offering in the configuration of the compute node, what's the problem with that?
[09:52] <heyongli> ijw: np, but alias is killed during the blue print review
[09:52] <baoli> keep in mind, this is provisioning task on the part of compute nodes
[09:52] <heyongli> btw: i'm lost connection, so i don't you you see this, i just paste again:
[09:53] <heyongli> <heyongli> yeah, what's in the extra_spec?
[09:53] <heyongli> <heyongli> currently in the extra spec is alias,  what would you save in there?
[09:53] <heyongli> <heyongli> no matter what you save there, that's will be global thing or something like alias currently been implemented.
[09:53] <heyongli> <heyongli> you can not elimation a global thing there, but the room for argue is where is should be define
[09:53] <heyongli> <heyongli> where it is
[09:53] <heyongli> <heyongli> and another topic/TODO is Nova community want see some code for this design for further evaluation
[09:53] <heyongli> <heyongli> i'm work on it, so we can make some progress
[09:53] <baoli> heyongli: it's <pci-flavor:no>
[09:53] == demorris [~daniel.mo at rrcs-67-78-97-126.sw.biz.rr.com] has quit [Ping timeout: 252 seconds]
[09:53] <baoli> sorry <pci-flavor:#of devices>
[09:54] <heyongli> baoli:  i'm lost , what do you mean
[09:54] <ijw> heyongli: er, since we're working on two documents I don't even know which document review you're talking about.
[09:54] <baoli> in the nova flavor, you can do pci-flavor (or pci_group): 2 in the extra_specs
[09:55] <heyongli> ijw: i paste the link there long time ago
[09:55] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 248 seconds]
[09:55] <heyongli> for review, only bp is valid... am i right?
[09:55] <ijw> I think it's fairly reasonable to say that at this point 'pci flavor', 'alias' and 'group' are all synonyms.  Whichever we use we're talking about a PCI device type we want to allocate.
[09:55] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[09:56] <ijw> heyongli: no, not really - this isn't a formal process, we're trying to reach agreement here;.
[09:56] <heyongli> ijw: yep, the current in tree, use synonyms: whitelist, alias
[09:56] == demorris [~daniel.mo at 72.32.115.230] has joined #openstack-meeting-alt
[09:57] == jecarey [jecarey at nat/ibm/x-njofcfftyghvgqwd] has joined #openstack-meeting-alt
[09:57] <ijw> What we agree we want: to be able to nominate devices by a fairly flexible method on a host (down to host/path and as widely as vendor/device) to a specific group; to schedule a machine with a combination of device allocations from various groups.  Right so far?
[09:57] <ijw> I think that's the core of where we agree.
[09:58] == gokrokve [~gokrokve at c-24-6-222-8.hsd1.ca.comcast.net] has joined #openstack-meeting-alt
[09:58] <heyongli> ijw: right i think, i agree this, and part of this is in tree except group.
[09:58] <ijw> Beyond that, there are two different proposals, one with an API and one which is config driven.  How do we choose between them?
[09:58] <heyongli> ijw: for me this is a trade off.
[09:59] <ijw> For me, it's not - I see the API as lots more complex and also harder to use
[09:59] <heyongli> config many many machine had scale problem
[09:59] == chandankumar [chandankum at nat/redhat/x-qhjjbtjvegvuzagq] has quit [Quit: Leaving]
[09:59] == amitgandhi [~amitgandh at 72.32.115.231] has joined #openstack-meeting-alt
[10:00] <ijw> But if you're configuring many machines, then there's no problem, because you have a deployment system that will configure them identically.  I do 10 node clusters automatically, I'm sure if I have 500 there's going to be no logging into them and accidentally typoing the config
[10:00] <baoli> heyongli: it's not really a scale problem in terms of provisioning
[10:00] <ijw> So that's a non-problem and I think we should remove that from the discussion
[10:00] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has quit [Ping timeout: 261 seconds]
[10:01] == markmcclain [~markmccla at c-24-99-84-83.hsd1.ga.comcast.net] has joined #openstack-meeting-alt
[10:01] <ijw> (Note this is different from host aggregates - I might aggregate hosts by physical location of by power strip, things I absolutely can't determine automatically, so there's no parallel there)
[10:01] == gokrokve [~gokrokve at c-24-6-222-8.hsd1.ca.comcast.net] has quit [Remote host closed the connection]
[10:01] == gokrokve [~gokrokve at c-24-6-222-8.hsd1.ca.comcast.net] has joined #openstack-meeting-alt
[10:02] == SushilKM [~SushilKM at 202.174.93.15] has quit [Ping timeout: 250 seconds]
[10:02] == jcooley_ [~jcooley at c-76-104-157-9.hsd1.wa.comcast.net] has joined #openstack-meeting-alt
[10:03] == mpanetta [~mpanetta at 72.3.234.177] has joined #openstack-meeting-alt
[10:03] <heyongli> aggregate can be use on pci, but it not must to be like this way, whitout aggregate it should still work .
[10:05] == denis_makogon [~dmakogon at 194.213.110.67] has quit [Ping timeout: 240 seconds]
[10:05] == flwang1 [~flwang at 106.120.178.5] has joined #openstack-meeting-alt
[10:05] == denis_makogon [~dmakogon at 194.213.110.67] has joined #openstack-meeting-alt
[10:06] == kgriffs [~kgriffs at nexus.kgriffs.com] has joined #openstack-meeting-alt
[10:06] <kgriffs> o/
[10:06] <amitgandhi> 0/
[10:06] <kgriffs> amitgandhi: you're alive!
[10:06] <flwang1> meeting time?
[10:06] <flaper87> yo yo
[10:06] <amitgandhi> yup made it back in one piece
[10:06] <flwang1> o/
[10:06] == ametts [~ametts at 72.3.234.177] has joined #openstack-meeting-alt
[10:07] <kgriffs> will Malini be here today for the mtg?
[10:08] <ijw> OK, we're out of time, I think we have to take this to the list.
[10:09] <ametts> kgriffs: I see her in #cloudqueues.  Just pinged her.
[10:09] <ijw> To which end I've just mailed out what I was saying.

On 12/17/13 10:09 AM, "Ian Wells" <ijw.ubuntu at cack.org.uk<mailto:ijw.ubuntu at cack.org.uk>> wrote:

Reiterating from the IRC mneeting, largely, so apologies.
Firstly, I disagree that https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support is an accurate reflection of the current state.  It's a very unilateral view, largely because the rest of us had been focussing on the google document that we've been using for weeks.

Secondly, I totally disagree with this approach.  This assumes that description of the (cloud-internal, hardware) details of each compute node is best done with data stored centrally and driven by an API.  I don't agree with either of these points.
Firstly, the best place to describe what's available on a compute node is in the configuration on the compute node.  For instance, I describe which interfaces do what in Neutron on the compute node.  This is because when you're provisioning nodes, that's the moment you know how you've attached it to the network and what hardware you've put in it and what you intend the hardware to be for - or conversely your deployment puppet or chef or whatever knows it, and Razor or MAAS has enumerated it, but the activities are equivalent.  Storing it centrally distances the compute node from its descriptive information for no good purpose that I can see and adds the complexity of having to go make remote requests just to start up.
Secondly, even if you did store this centrally, it's not clear to me that an API is very useful.  As far as I can see, the need for an API is really the need to manage PCI device flavors.  If you want that to be API-managed, then the rest of a (rather complex) API cascades from that one choice.  Most of the things that API lets you change (expressions describing PCI devices) are the sort of thing that you set once and only revisit when you start - for instance - deploying new hosts in a different way.

I at the parallel in Neutron provider networks.  They're config driven, largely on the compute hosts.  Agents know what ports on their machine (the hardware tie) are associated with provider networks, by provider network name.  The controller takes 'neutron net-create ... --provider:network 'name'' and uses that to tie a virtual network to the provider network definition on each host.  What we absolutely don't do is have a complex admin API that lets us say 'in host aggregate 4, provider network x (which I made earlier) is connected to eth6'.

--
Ian.

On 17 December 2013 03:12, yongli he <yongli.he at intel.com<mailto:yongli.he at intel.com>> wrote:
On 2013年12月16日 22:27, Robert Li (baoli) wrote:
Hi Yongli,

The IRC meeting we have for PCI-Passthrough is the forum for discussion on SR-IOV support in openstack. I think the goal is to come up with a plan on both the nova and neutron side in support of the SR-IOV, and the current focus is on the nova side. Since you've done a lot of work on it already, would you like to lead tomorrow's discussion at UTC 1400?

Robert , you lead the meeting very well i enjoy you setup every for us, keep going on it -:)

I'd like to give you guy a summary of current state, let's discuss it then.
https://wiki.openstack.org/wiki/PCI_passthrough_SRIOV_support

1)  fade out alias ( i think this ok for all)
2)  white list became pic-flavor ( i think this ok for all)
3)  address simply regular expression support: only * and a number range is support [hex-hex]. ( i think this ok?)
4)  aggregate : now it's clear enough, and won't impact SRIOV.  ( i think this irrelevant to SRIOV now)

5)  SRIOV use case, if you suggest a use case, please given a full example like this: [discuss: compare to other solution]

·         create a pci flavor for the SRIOV

  nova pci-flavor-create  name 'vlan-SRIOV'  description "xxxxx"

  nova pci-flavor-update UUID  set    'description'='xxxx'   'address'= '0000:01:*.7'

Admin config SRIOV
·         create pci-flavor :

   {"name": "privateNIC", "neutron-network-uuid": "uuid-1", ...}

   {"name": "publicNIC", "neutron-network-uuid": "uuid-2", ...}

   {"name": "smallGPU", "neutron-network-uuid": "", ...}
·         set aggregate meta according the flavors existed in the hosts

flavor extra-specs, for a VM that gets two small GPUs and VIFs attached from the above SRIOV NICs:

   nova aggregate-set-metadata pci-aware-group set 'pci-flavor'='smallGPU,oldGPU, privateNIC,privateNIC'
·         create instance flavor for sriov

    nova flavor-key 100 set  'pci-flavor='1:privateNIC;  1: publicNIC;  2:smallGPU,oldGPU'
·         User just specifies a quantum port as normal:

   nova boot --flavor "sriov-plus-two-gpu" --image img --nic net-id=uuid-2 --nic net-id=uuid-1 vm-name

Yongli

Thanks,
Robert

On 12/11/13 8:09 PM, "He, Yongli" <yongli.he at intel.com<mailto:yongli.he at intel.com>> wrote:

Hi, all
Please continue to foucs on the blueprint, it change after reviewing.  And  for this point:

>5. flavor style for sriov: i just list the flavor style in the design but for the style
>              --nic
>                   --pci-flavor  PowerfullNIC:1
 >  still possible to work, so what's the real impact to sriov from the flavor design?

>As you can see from the log, Irena has some strong opinions on this, and I tend to agree with her. The problem we need to solve is this: we need a means to associate a nic (or port) with a PCI device that is allocated out of a PCI >flavor or a PCI group. We think that we presented a complete solution in our google doc.
It’s not so clear, could you please list the key point here. Btw, the blue print I sent Monday had changed for this, please check.

Yongli he

From: Robert Li (baoli) [mailto:baoli at cisco.com]
Sent: Wednesday, December 11, 2013 10:18 PM
To: He, Yongli; Sandhya Dasu (sadasu); OpenStack Development Mailing List (not for usage questions); Jiang, Yunhong; Irena Berezovsky; prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>; chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>; Itzik Brown; john at johngarbutt.com<mailto:john at johngarbutt.com>
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Hi Yongli,

Thank you very much for sharing the Wiki with us on Monday so that we have a better understanding on your ideas and thoughts. Please see embedded comments.

--Robert

On 12/10/13 8:35 PM, "yongli he" <yongli.he at intel.com<mailto:yongli.he at intel.com>> wrote:

On 2013年12月10日 22:41, Sandhya Dasu (sadasu) wrote:
Hi,
   I am trying to resurrect this email thread since discussions have split between several threads and is becoming hard to keep track.

An update:

New PCI Passthrough meeting time: Tuesdays UTC 1400.

New PCI flavor proposal from Nova:
https://wiki.openstack.org/wiki/PCI_configration_Database_and_API#Take_advantage_of_host_aggregate_.28T.B.D.29
Hi, all
  sorry for miss the meeting, i was seeking John at that time. from the log i saw some concern about new design,  i list them there and try to clarify it per my opinion:

1. configuration going to deprecated:   this might impact SRIOV.  if possible, please list what kind of impact make to you.

Regarding the nova API pci-flavor-update, we had a face-to-face discussion over use of a nova API to provision/define/configure PCI passthrough list during the ice-house summit. I kind of like the idea initially. As you can see from the meeting log, however, I later thought that in a distributed system, using a centralized API to define resources per compute node, which could come and go any time, doesn't seem to provide any significant benefit. This is the reason that I didn't mention it in our google doc https://docs.google.com/document/d/1EMwDg9J8zOxzvTnQJ9HwZdiotaVstFWKIuKrPse6JOs/edit#<https://docs.google.com/document/d/1EMwDg9J8zOxzvTnQJ9HwZdiotaVstFWKIuKrPse6JOs/edit>

If you agree that pci-flavor and pci-group is kind of the same thing, then we agree with you that the pci-flavor-create API is needed. Since pci-flavor or pci-group is global, then such an API can be used for resource registration/validation on nova server. In addition, it can be used to facilitate the display of PCI devices per node, per group, or in the entire cloud, etc.

2. <baoli>So the API seems to be combining the whitelist + pci-group
    yeah, it's actually almost same thing, 'flavor' 'pci-group' or 'group'. the real different is this flavor going to deprecated the alias, and combine tight to aggregate or flavor.

Well, with pci-group, we recommended to deprecate the PCI alias because we think it is redundant.

We think that specification of PCI requirement in the flavor's extra spec is still needed as it's a generic means to allocate PCI devices. In addition, it can be used as properties in the host aggregate as well.

3. feature:
   this design is not to say the feature is not work, but changed.  if auto discovery feature is possible, we got 'feature' form the device, then use the feature to define the pci-flavor.  it's also possible create default pci-flavor for this. so the feature concept will be impact, my feeling, we should given a separated bp for feature, and not in this round change, so here we only thing is keep the feature is possible.

I think that it's ok to have separate BPs. But we think that auto discovery is an essential part of the design, and therefore it should be implemented with more helping hands.

4. address regular expression: i'm fine with the wild-match style.

Sounds good. One side node is that I noticed that the driver for intel 82576 cards has a strange slot assignment scheme. So the final definition of it may need to accommodate that as well.

5. flavor style for sriov: i just list the flavor style in the design but for the style
              --nic
                   --pci-flavor  PowerfullNIC:1
   still possible to work, so what's the real impact to sriov from the flavor design?

As you can see from the log, Irena has some strong opinions on this, and I tend to agree with her. The problem we need to solve is this: we need a means to associate a nic (or port) with a PCI device that is allocated out of a PCI flavor or a PCI group. We think that we presented a complete solution in our google doc.

At this point, I really believe that we should combine our efforts and ideas. As far as how many BPs are needed, it should be a trivial matter after we have agreed on a complete solution.

Yongli He

Thanks,
Sandhya

From: Sandhya Dasu <sadasu at cisco.com<mailto:sadasu at cisco.com>>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Date: Thursday, November 7, 2013 9:44 PM
To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>, "Jiang, Yunhong" <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>>, "Robert Li (baoli)" <baoli at cisco.com<mailto:baoli at cisco.com>>, Irena Berezovsky <irenab at mellanox.com<mailto:irenab at mellanox.com>>, "prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>" <prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>>, "chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>" <chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>>, "He, Yongli" <yongli.he at intel.com<mailto:yongli.he at intel.com>>, Itzik Brown <ItzikB at mellanox.com<mailto:ItzikB at mellanox.com>>
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Hi,
     The discussions during the summit were very productive. Now, we are ready to setup our IRC meeting.

Here are some slots that look like they might work for us.

1. Wed 2 – 3 pm UTC.
2. Thursday 12 – 1 pm UTC.
3. Thursday 7 – 8pm UTC.

Please vote.

Thanks,
Sandhya

From: Sandhya Dasu <sadasu at cisco.com<mailto:sadasu at cisco.com>>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Date: Tuesday, November 5, 2013 12:03 PM
To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>, "Jiang, Yunhong" <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>>, "Robert Li (baoli)" <baoli at cisco.com<mailto:baoli at cisco.com>>, Irena Berezovsky <irenab at mellanox.com<mailto:irenab at mellanox.com>>, "prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>" <prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>>, "chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>" <chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>>, "He, Yongli" <yongli.he at intel.com<mailto:yongli.he at intel.com>>, Itzik Brown <ItzikB at mellanox.com<mailto:ItzikB at mellanox.com>>
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Just to clarify, the discussion is planned for 10 AM Wednesday morning at the developer's lounge.

Thanks,
Sandhya

From: Sandhya Dasu <sadasu at cisco.com<mailto:sadasu at cisco.com>>
Reply-To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>
Date: Tuesday, November 5, 2013 11:38 AM
To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>, "Jiang, Yunhong" <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>>, "Robert Li (baoli)" <baoli at cisco.com<mailto:baoli at cisco.com>>, Irena Berezovsky <irenab at mellanox.com<mailto:irenab at mellanox.com>>, "prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>" <prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>>, "chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>" <chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>>, "He, Yongli" <yongli.he at intel.com<mailto:yongli.he at intel.com>>, Itzik Brown <ItzikB at mellanox.com<mailto:ItzikB at mellanox.com>>
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Hi,
    We are planning to have a discussion at the developer's lounge tomorrow morning at 10:00 am. Please feel free to drop by if you are interested.

Thanks,
Sandhya

From: <Jiang>, Yunhong <yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>>
Date: Thursday, October 31, 2013 6:21 PM
To: "Robert Li (baoli)" <baoli at cisco.com<mailto:baoli at cisco.com>>, Irena Berezovsky <irenab at mellanox.com<mailto:irenab at mellanox.com>>, "prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>" <prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>>, "chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>" <chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>>, "He, Yongli" <yongli.he at intel.com<mailto:yongli.he at intel.com>>, Itzik Brown <ItzikB at mellanox.com<mailto:ItzikB at mellanox.com>>
Cc: OpenStack Development Mailing List <openstack-dev at lists.openstack.org<mailto:openstack-dev at lists.openstack.org>>, "Brian Bowen (brbowen)" <brbowen at cisco.com<mailto:brbowen at cisco.com>>, "Kyle Mestery (kmestery)" <kmestery at cisco.com<mailto:kmestery at cisco.com>>, Sandhya Dasu <sadasu at cisco.com<mailto:sadasu at cisco.com>>
Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support

Robert, I think your change request for pci alias should be covered by the extra infor enhancement. https://blueprints.launchpad.net/nova/+spec/pci-extra-info  and Yongli is working on it.

I’m not sure how the port profile is passed to the connected switch, is it a Cisco VMEFX specific method or libvirt method? Sorry I’m not well on network side.

--jyh

From: Robert Li (baoli) [mailto:baoli at cisco.com]
Sent: Wednesday, October 30, 2013 10:13 AM
To: Irena Berezovsky; Jiang, Yunhong; prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>; chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>; He, Yongli; Itzik Brown
Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu)
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Hi,

Regarding physical network mapping,  This is what I thought.

consider the following scenarios:
   1. a compute node with SRIOV only interfaces attached to a physical network. the node is connected to one upstream switch
   2. a compute node with both SRIOV interfaces and non-SRIOV interfaces attached to a physical network. the node is connected to one upstream switch
   3. in addition to case 1 &2, a compute node may have multiple vNICs that are connected to different upstream switches.

CASE 1:
 -- the mapping from a virtual network (in terms of neutron) to a physical network is actually done by binding a port profile to a neutron port. With cisco's VM-FEX, a port profile is associated with one or multiple vlans. Once the neutron port is bound with this port-profile in the upstream switch, it's effectively plugged into the physical network.
 -- since the compute node is connected to one upstream switch, the existing nova PCI alias will be sufficient. For example, one can boot a Nova instance that is attached to a SRIOV port with the following command:
          nova boot —flavor m1.large —image <image-id> --nic net-id=<net>,pci-alias=<alias>,sriov=<direct|macvtap>,port-profile=<profile>
    the net-id will be useful in terms of allocating IP address, enable dhcp, etc that is associated with the network.
-- the pci-alias specified in the nova boot command is used to create a PCI request for scheduling purpose. a PCI device is bound to a neutron port during the instance build time in the case of nova boot. Before invoking the neutron API to create a port, an allocated PCI device out of a PCI alias will be located from the PCI device list object. This device info among other information will be sent to neutron to create the port.

CASE 2:
-- Assume that OVS is used for the non-SRIOV interfaces. An example of configuration with ovs plugin would look like:
            bridge_mappings = physnet1:br-vmfex
            network_vlan_ranges = physnet1:15:17
            tenant_network_type = vlan
    When a neutron network is created, a vlan is either allocated or specified in the neutron net-create command. Attaching a physical interface to the bridge (in the above example br-vmfex) is an administrative task.
-- to create a Nova instance with non-SRIOV port:
           nova boot —flavor m1.large —image <image-id> --nic net-id=<net>
-- to create a Nova instance with SRIOV port:
           nova boot —flavor m1.large —image <image-id> --nic net-id=<net>,pci-alias=<alias>,sriov=<direct|macvtap>,port-profile=<profile>
    it's essentially the same as in the first case. But since the net-id is already associated with a vlan, the vlan associated with the port-profile must be identical to that vlan. This has to be enforced by neutron.
    again, since the node is connected to one upstream switch, the existing nova PCI alias should be sufficient.

CASE 3:
-- A compute node might be connected to multiple upstream switches, with each being a separate network. This means SRIOV PFs/VFs are already implicitly associated with physical networks. In the none-SRIOV case, a physical interface is associated with a physical network by plugging it into that network, and attaching this interface to the ovs bridge that represents this physical network on the compute node. In the SRIOV case, we need a way to group the SRIOV VFs that belong to the same physical networks. The existing nova PCI alias is to facilitate PCI device allocation by associating <product_id, vendor_id> with an alias name. This will no longer be sufficient. But it can be enhanced to achieve our goal. For example, the PCI device domain, bus (if their mapping to vNIC is fixed across boot) may be added into the alias, and the alias name should be corresponding to a list of tuples.

Another consideration is that a VF or PF might be used on the host for other purposes. For example, it's possible for a neutron DHCP server to be bound with a VF. Therefore, there needs a method to exclude some VFs from a group.  One way is to associate an exclude list with an alias.

The enhanced PCI alias can be used to support features other than neutron as well. Essentially, a PCI alias can be defined as a group of PCI devices associated with a feature. I'd think that this should be addressed with a separate blueprint.

Thanks,
Robert

On 10/30/13 12:59 AM, "Irena Berezovsky" <irenab at mellanox.com<mailto:irenab at mellanox.com>> wrote:

Hi,
Please see my answers inline

From: Jiang, Yunhong [mailto:yunhong.jiang at intel.com]
Sent: Tuesday, October 29, 2013 10:17 PM
To: Irena Berezovsky; Robert Li (baoli); prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>; chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>; He, Yongli; Itzik Brown
Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu)
Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support

Your explanation of the virtual network and physical network is quite clear and should work well. We need change nova code to achieve it, including get the physical network for the virtual network, passing the physical network requirement to the filter properties etc.
[IrenaB]  The physical network is already available to nova at networking/nova/api at as virtual network attribute, it then passed to the VIF driver. We will push soon the fix to:https://bugs.launchpad.net/nova/+bug/1239606 ; which will provide general support for getting this information.

For your port method, so you mean we are sure to passing network id to ‘nova boot’ and nova will create the port during VM boot, am I right?  Also, how can nova knows that it need allocate the PCI device for the port? I’d suppose that in SR-IOV NIC environment, user don’t need specify the PCI requirement. Instead, the PCI requirement should come from the network configuration and image property. Or you think user still need passing flavor with pci request?
[IrenaB] There are two way to apply port method. One is to pass network id on nova boot and use default type as chosen in the neutron config file for vnic type. Other way is to define port with required vnic type and other properties if applicable, and run ‘nova boot’ with port id argument. Going forward with nova support for PCI devices awareness, we do need a way impact scheduler choice to land VM on suitable Host with available PC device that has the required connectivity.

--jyh

From: Irena Berezovsky [mailto:irenab at mellanox.com]
Sent: Tuesday, October 29, 2013 3:17 AM
To: Jiang, Yunhong; Robert Li (baoli); prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>; chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>; He, Yongli; Itzik Brown
Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu)
Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support

Hi Jiang, Robert,
IRC meeting option works for me.
If I understand your question below, you are looking for a way to tie up between requested virtual network(s) and requested PCI device(s). The way we did it in our solution  is to map a provider:physical_network to an interface that represents the Physical Function. Every virtual network is bound to the provider:physical_network, so the PCI device should be allocated based on this mapping.  We can  map a PCI alias to the provider:physical_network.

Another topic to discuss is where the mapping between neutron port and PCI device should be managed. One way to solve it, is to propagate the allocated PCI device details to neutron on port creation.
In case  there is no qbg/qbh support, VF networking configuration should be applied locally on the Host.
The question is when and how to apply networking configuration on the PCI device?
We see the following options:

•         it can be done on port creation.

•         It can be done when nova VIF driver is called for vNIC plugging. This will require to  have all networking configuration available to the VIF driver or send request to the neutron server to obtain it.

•         It can be done by  having a dedicated L2 neutron agent on each Host that scans for allocated PCI devices  and then retrieves networking configuration from the server and configures the device. The agent will be also responsible for managing update requests coming from the neutron server.

For macvtap vNIC type assignment, the networking configuration can be applied by a dedicated L2 neutron agent.

BR,
Irena

From: Jiang, Yunhong [mailto:yunhong.jiang at intel.com]
Sent: Tuesday, October 29, 2013 9:04 AM

To: Robert Li (baoli); Irena Berezovsky; prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>; chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>; He, Yongli; Itzik Brown
Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu)
Subject: RE: [openstack-dev] [nova] [neutron] PCI pass-through network support

Robert, is it possible to have a IRC meeting? I’d prefer to IRC meeting because it’s more openstack style and also can keep the minutes clearly.

To your flow, can you give more detailed example. For example, I can consider user specify the instance with –nic option specify a network id, and then how nova device the requirement to the PCI device? I assume the network id should define the switches that the device can connect to , but how is that information translated to the PCI property requirement? Will this translation happen before the nova scheduler make host decision?

Thanks
--jyh

From: Robert Li (baoli) [mailto:baoli at cisco.com]
Sent: Monday, October 28, 2013 12:22 PM
To: Irena Berezovsky; prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>; Jiang, Yunhong; chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>; He, Yongli; Itzik Brown
Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu)
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Hi Irena,

Thank you very much for your comments. See inline.

--Robert

On 10/27/13 3:48 AM, "Irena Berezovsky" <irenab at mellanox.com<mailto:irenab at mellanox.com>> wrote:

Hi Robert,
Thank you very much for sharing the information regarding your efforts. Can you please share your idea of the end to end flow? How do you suggest  to bind Nova and Neutron?

The end to end flow is actually encompassed in the blueprints in a nutshell. I will reiterate it in below. The binding between Nova and Neutron occurs with the neutron v2 API that nova invokes in order to provision the neutron services. The vif driver is responsible for plugging in an instance onto the networking setup that neutron has created on the host.

Normally, one will invoke "nova boot" api with the —nic options to specify the nic with which the instance will be connected to the network. It currently allows net-id, fixed ip and/or port-id to be specified for the option. However, it doesn't allow one to specify special networking requirements for the instance. Thanks to the nova pci-passthrough work, one can specify PCI passthrough device(s) in the nova flavor. But it doesn't provide means to tie up these PCI devices in the case of ethernet adpators with networking services. Therefore the idea is actually simple as indicated by the blueprint titles, to provide means to tie up SRIOV devices with neutron services. A work flow would roughly look like this for 'nova boot':

      -- Specifies networking requirements in the —nic option. Specifically for SRIOV, allow the following to be specified in addition to the existing required information:
               . PCI alias
               . direct pci-passthrough/macvtap
               . port profileid that is compliant with 802.1Qbh

        The above information is optional. In the absence of them, the existing behavior remains.

     -- if special networking requirements exist, Nova api creates PCI requests in the nova instance type for scheduling purpose

     -- Nova scheduler schedules the instance based on the requested flavor plus the PCI requests that are created for networking.

     -- Nova compute invokes neutron services with PCI passthrough information if any

     --  Neutron performs its normal operations based on the request, such as allocating a port, assigning ip addresses, etc. Specific to SRIOV, it should validate the information such as profileid, and stores them in its db. It's also possible to associate a port profileid with a neutron network so that port profileid becomes optional in the —nic option. Neutron returns  nova the port information, especially for PCI passthrough related information in the port binding object. Currently, the port binding object contains the following information:
          binding:vif_type
          binding:host_id
          binding:profile
          binding:capabilities

    -- nova constructs the domain xml and plug in the instance by calling the vif driver. The vif driver can build up the interface xml based on the port binding information.

The blueprints you registered make sense. On Nova side, there is a need to bind between requested virtual network and PCI device/interface to be allocated as vNIC.
On the Neutron side, there is a need to  support networking configuration of the vNIC. Neutron should be able to identify the PCI device/macvtap interface in order to apply configuration. I think it makes sense to provide neutron integration via dedicated Modular Layer 2 Mechanism Driver to allow PCI pass-through vNIC support along with other networking technologies.

I haven't sorted through this yet. A neutron port could be associated with a PCI device or not, which is a common feature, IMHO. However, a ML2 driver may be needed specific to a particular SRIOV technology.

During the Havana Release, we introduced Mellanox Neutron plugin that enables networking via SRIOV pass-through devices or macvtap interfaces.
We want to integrate our solution with PCI pass-through Nova support.  I will be glad to share more details if you are interested.

Good to know that you already have a SRIOV implementation. I found out some information online about the mlnx plugin, but need more time to get to know it better. And certainly I'm interested in knowing its details.

The PCI pass-through networking support is planned to be discussed during the summit: http://summit.openstack.org/cfp/details/129. I think it’s worth to drill down into more detailed proposal and present it during the summit, especially since it impacts both nova and neutron projects.

I agree. Maybe we can steal some time in that discussion.

Would you be interested in collaboration on this effort? Would you be interested to exchange more emails or set an IRC/WebEx meeting during this week before the summit?

Sure. If folks want to discuss it before the summit, we can schedule a webex later this week. Or otherwise, we can continue the discussion with email.

Regards,
Irena

From: Robert Li (baoli) [mailto:baoli at cisco.com]
Sent: Friday, October 25, 2013 11:16 PM
To: prashant.upadhyaya at aricent.com<mailto:prashant.upadhyaya at aricent.com>; Irena Berezovsky; yunhong.jiang at intel.com<mailto:yunhong.jiang at intel.com>; chris.friesen at windriver.com<mailto:chris.friesen at windriver.com>; yongli.he at intel.com<mailto:yongli.he at intel.com>
Cc: OpenStack Development Mailing List; Brian Bowen (brbowen); Kyle Mestery (kmestery); Sandhya Dasu (sadasu)
Subject: Re: [openstack-dev] [nova] [neutron] PCI pass-through network support

Hi Irena,

This is Robert Li from Cisco Systems. Recently, I was tasked to investigate such support for Cisco's systems that support VM-FEX, which is a SRIOV technology supporting 802-1Qbh. I was able to bring up nova instances with SRIOV interfaces, and establish networking in between the instances that employes the SRIOV interfaces. Certainly, this was accomplished with hacking and some manual intervention. Based on this experience and my study with the two existing nova pci-passthrough blueprints that have been implemented and committed into Havana (https://blueprints.launchpad.net/nova/+spec/pci-passthrough-base and
https://blueprints.launchpad.net/nova/+spec/pci-passthrough-libvirt),  I registered a couple of blueprints (one on Nova side, the other on the Neutron side):

https://blueprints.launchpad.net/nova/+spec/pci-passthrough-sriov
https://blueprints.launchpad.net/neutron/+spec/pci-passthrough-sriov

in order to address SRIOV support in openstack.

Please take a look at them and see if they make sense, and let me know any comments and questions. We can also discuss this in the summit, I suppose.

I noticed that there is another thread on this topic, so copy those folks  from that thread as well.

thanks,
Robert

On 10/16/13 4:32 PM, "Irena Berezovsky" <irenab at mellanox.com<mailto:irenab at mellanox.com>> wrote:

Hi,
As one of the next steps for PCI pass-through I would like to discuss is the support for PCI pass-through vNIC.
While nova takes care of PCI pass-through device resources  management and VIF settings, neutron should manage their networking configuration.
I would like to register asummit proposal to discuss the support for PCI pass-through networking.
I am not sure what would be the right topic to discuss the PCI pass-through networking, since it involve both nova and neutron.
There is already a session registered by Yongli on nova topic to discuss the PCI pass-through next steps.
I think PCI pass-through networking is quite a big topic and it worth to have a separate discussion.
Is there any other people who are interested to discuss it and share their thoughts and experience?

Regards,
Irena

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev at lists.openstack.org<mailto:OpenStack-dev at lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131223/2311a23c/attachment-0001.html>

Open Stack

[openstack-dev] [nova] [neutron] Todays' meeting log: PCI pass-through network support

OpenStack

Community

Documentation

Branding & Legal