[placement][nova][ptg] resource provider affinity

newer
[monasca] Monasca PTG sessions...

older
[nova][neutron][ptg] Summary:...

Chris Dent

9 Apr 2019 9 Apr '19

9:36 p.m.

Spec: https://review.openstack.org/650476

...

From the commit message:

To support NUMA and similar concepts, this proposes the ability to request resources from different providers nested under a common subtree (below the root provider). There's much in the feature described by the spec and the surrounding context that is frequently a source of contention in the placement group, so working through this spec is probably going to require some robust discussion. Doing most of that before the PTG will help make sure we're not going in circles in person.k Some of the areas of potential contention: * Adequate for limited but maybe not all use case solutions * Strict trait constructionism * Evolving the complexity of placement solely for the satisfaction of hardware representation in Nova * Inventory-less resource providers * Developing new features in placement before existing features are fully used in client services * Others? I list this not because they are deal breakers or the only thing that matters, but because they have presented stumbling blocks in the past and we may as well work to address them (or make an agreement to punt them until later) otherwise there will be lingering dread. And, beyond all that squishy stuff, there is the necessary discussion over the solution described in the spec. There are several alternatives listed in the spec, and a few more in the comments. We'd like to figure out the best solution that can actually be done in a reasonable amount of time, not the best solution in the absolute. Discuss! -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Show replies by date

Eric Fried

11 Apr 11 Apr

6:28 a.m.

I feel like good progress is occurring in the

...

Spec: https://review.openstack.org/650476

As of this writing, I'm backing totally off of the solution proposed there and instead favoring one of: Solution 1: Assume "subtree" means "exactly one tier away from the root" and express groupings of, er, request groups as needing to come from "the same subtree". We should *not* do this by making a queryparam whose value contains request group numbers, for reasons explained in the spec/comments. Any other attempt to represent groups-of-request-groups in a querystring is (IMO) untenable, so we should cut over to accepting the query as a JSON payload. Here's a sample of what it would look like for, "give me proc, mem, and two different nets, all from the same NUMA node; and disk from wherever": { groups: [ { requests: [ { resources: {DISK_GB: 10} } ] } { requests: [ { resources: {VCPU: 2, MEMORY_MB: 128} } { resources: {VF: 1}, required: [NET_A] }, { resources: {VF: 1}, required: [NET_B] }, ], group_policy: subtree_affinity }, ] } Solution 2: Change the data model in some way TBD - your inputs here.

...

Some of the areas of potential contention:

* Adequate for limited but maybe not all use case solutions

Solution 1 is actually *less* limited than the original proposal, in that it supports the "different nets" bit of the above example. The only limitation that remains I think is that subtrees *have* to start one tier away from the root. And as discussed, there's no known use case where that's a problem.

...

* Strict trait constructionism

No traits, no contention.

...

* Evolving the complexity of placement solely for the satisfaction of hardware representation in Nova

Mm. I get that it would be nice to identify non-Nova needs for this feature, but hopefully you're not suggesting that we should avoid doing it if we can't find some.

...

* Inventory-less resource providers

An interesting topic, but IMO not related to $subject. This will come up as soon as we model NUMA + sharing at all. We shouldn't muddy these waters by hashing it out here.

...

* Developing new features in placement before existing features are fully used in client services

Are we not pretty close on this? Are there any placement features that don't have client uses either already implemented or proposed for Train? Again, IMO not a thing to even consider blocking on. And also not specific to this topic.

...

I list this not because they are deal breakers or the only thing that matters, but because they have presented stumbling blocks in the past and we may as well work to address them (or make an agreement to punt them until later) otherwise there will be lingering dread.

How about we do that in a separate thread, then?

...

We'd like to figure out the best solution that can actually be done in a reasonable amount of time, not the best solution in the absolute.

Yeah, so to that point, I fear analysis paralysis if we decide to go JSON query, trying to get the schema absolutely perfect. Parity with the existing querystring impl plus the subtree affinity feature ought to be good enough to start with. efried .

Chris Dent

9:24 p.m.

On Wed, 10 Apr 2019, Eric Fried wrote:

...

I feel like good progress is occurring in the

...
Spec: https://review.openstack.org/650476

As of this writing, I'm backing totally off of the solution proposed there and instead favoring one of:

Solution 1: Assume "subtree" means "exactly one tier away from the root" and express groupings of, er, request groups as needing to come from "the same subtree". We should *not* do this by making a queryparam whose value contains request group numbers, for reasons explained in the spec/comments. Any other attempt to represent groups-of-request-groups in a querystring is (IMO) untenable, so we should cut over to accepting the query as a JSON payload. Here's a sample of what it would look like for, "give me proc, mem, and two different nets, all from the same NUMA node; and disk from wherever":

{ groups: [ { requests: [ { resources: {DISK_GB: 10} } ] } { requests: [ { resources: {VCPU: 2, MEMORY_MB: 128} } { resources: {VF: 1}, required: [NET_A] }, { resources: {VF: 1}, required: [NET_B] }, ], group_policy: subtree_affinity }, ] }

For those of us not in your head, can you explain how the above is different/better from the following pseudo-query: resources=DISK_GB; resources1=VCPU:2,MEMORY_MB:128; resources2=VF:1;required2=NET_A;group_policy2=subtree_rooter:resources1; resources3=VF:1;required2=NET_B;group_policy3=subtree_rooter:resources1 Apologies if I'm missing some details. I probably am, that's why I'm asking. I'm intentionally ignoring your "should not do this by making a queryparam ... group numbers" because I didn't fully understand the explanation/reasoning in the discussion on the spec, so I'm after additional explanation (that is, if we have to merge request groups we still need to preserve the distinctiveness of the groups, and if some form of a hierarchical relationship is present we need to maintain that). To be clear, I'm not trying to block the JSON body, I'm trying to understand. That is: my request for an explanation of the differences is exactly that.

...

...
* Evolving the complexity of placement solely for the satisfaction of hardware representation in Nova

Mm. I get that it would be nice to identify non-Nova needs for this feature, but hopefully you're not suggesting that we should avoid doing it if we can't find some.

As I said on the spec at https://review.openstack.org/#/c/650476/1/doc/source/specs/train/approved/20... "a way that placement can work that happens to satisfy NUMA" (using placement-related primitives) as opposed to making a solution that is directly modeled off NUMA concepts. I'm pretty sure we're not doing the latter, but it is a thing to be aware of while having a wide-open conversation (which is what this is). I'm _not_ suggesting we avoid it.

...

...
* Inventory-less resource providers

An interesting topic, but IMO not related to $subject. This will come up as soon as we model NUMA + sharing at all. We shouldn't muddy these waters by hashing it out here.

I'm sorry to beat on this drum over and over again, but the reason to have this pre-PTG stuff is exactly to churn up the waters and get all the ideas out in the open so that we are thinking about the entire system, not specific details.

...

...
* Developing new features in placement before existing features are fully used in client services

Are we not pretty close on this? Are there any placement features that don't have client uses either already implemented or proposed for Train? Again, IMO not a thing to even consider blocking on. And also not specific to this topic.

It's nothing to do with blocking, it's about being aware. Strict adherence to thread discipline is not desired in this process. Ramble. Talk. Spitball. We have the time. Use it.

...

...
I list this not because they are deal breakers or the only thing that matters, but because they have presented stumbling blocks in the past and we may as well work to address them (or make an agreement to punt them until later) otherwise there will be lingering dread.

How about we do that in a separate thread, then?

Why? See my two paragraphs above and my facepalming in another thread.

...

...
We'd like to figure out the best solution that can actually be done in a reasonable amount of time, not the best solution in the absolute.

Yeah, so to that point, I fear analysis paralysis if we decide to go JSON query, trying to get the schema absolutely perfect. Parity with the existing querystring impl plus the subtree affinity feature ought to be good enough to start with.

I suspect if we can come up a sufficient explanation for why query params are not gonna do it (which might be relatively painless) we might, if we give ourselves the leeway to hash it out effectively, come up with a reasonable set of constraints for an initial version (which we have license to evolve). -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Eric Fried

12 Apr 12 Apr

2:52 a.m.

...

...
{ groups: [ { requests: [ { resources: {DISK_GB: 10} } ] } { requests: [ { resources: {VCPU: 2, MEMORY_MB: 128} } { resources: {VF: 1}, required: [NET_A] }, { resources: {VF: 1}, required: [NET_B] }, ], group_policy: subtree_affinity }, ] }

For those of us not in your head, can you explain how the above is different/better from the following pseudo-query:

resources=DISK_GB; resources1=VCPU:2,MEMORY_MB:128; resources2=VF:1;required2=NET_A;group_policy2=subtree_rooter:resources1; resources3=VF:1;required2=NET_B;group_policy3=subtree_rooter:resources1

Apologies if I'm missing some details. I probably am, that's why I'm asking. I'm intentionally ignoring your "should not do this by making a queryparam ... group numbers" because I didn't fully understand the explanation/reasoning in the discussion on the spec, so I'm after additional explanation (that is, if we have to merge request groups we still need to preserve the distinctiveness of the groups, and if some form of a hierarchical relationship is present we need to maintain that).

In nova, with the bandwidth (and forthcoming accelerator) code in play, the group numbers are generated on the fly and in pretty different sections of the code. But you're right: whether by tracking group numbers or otherwise, we need some way of clustering the groups together. That's going to be tricky on the nova side regardless. Beyond that, I just have a gut ick response to one parameter's *value* referring to another parameter's *key*. That seems dirty/hacky to me. I can probably get over it.

...

...
...
* Inventory-less resource providers

An interesting topic, but IMO not related to $subject. This will come up as soon as we model NUMA + sharing at all. We shouldn't muddy these waters by hashing it out here.

I'm sorry to beat on this drum over and over again, but the reason to have this pre-PTG stuff is exactly to churn up the waters and get all the ideas out in the open so that we are thinking about the entire system, not specific details.

Ack. Perhaps I should have said: we're already discussing it in thread "Resourceless trait filters" [1]. So, lacking any technical connection to the subject of this thread (true since we killed the idea of using a trait) we might as well isolate the discussion there.

...

...
How about we do that in a separate thread, then?

Why? See my two paragraphs above and my facepalming in another thread.

I get it, I just feel like a) this topic is intricate enough in its technical aspects, we'll have enough of a challenge reaching consensus without digressing into philosophical tangents; and b) said philosophical tangents are already being explored in other threads (if they're not, we should start them). In other words, my preference would be to keep each thread focused as narrowly as possible. Otherwise we run the risk of losing sight of the original issue. (I literally just now had to glance back up at the subject line to remind myself what this thread was about.) efried [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/thread.htm...

Alex Xu

15 Apr 15 Apr

10:04 p.m.

Contribute an another idea at here. Pretty sure I didn't explore this with all the cases by my limited vision. So I'm thinking we can continue use query string build a tree structure by the request group number. I know the number request group problem for the cyborg and neutron, but I think there must be some way to describe the cyborg device will be attached to which instance numa node. So I guess that it isn't the fault of number request group, maybe we are just missing a way to describe that. For the case in the spec https://review.openstack.org/#/c/650476, an instance with one numa node and two VFs from different network. We can write as below: ?resources=DISK_GB:10& resources1=VCPU:2,MEMORY_MB:128& resources1.1=VF:1&required=NET_A resources1.2=VF:1&required=NET_B Another example, we request an instance with two numa nodes, 2 vcpus and 128mb memory in each node. In each node has two VFs come from different PF to have HA. ?resources=DISK_GB:10& resources1=VCPU:2,MEMORY_MB:128& resources1.1=VF:1& resources1.2=VF:1& resources2=VCPU:2,MEMORY_MB:128& resources2.1=VF:1& resources2.2=VF:1& group_policy=isolate& group_policy1=isolate& group_policy2=isolate The `group_policy` ensure the resources1 and resources2 aren't coming from the same RP. The 'group_poilcy1' ensures `resource1.x` aren't coming from the same RP. The `group_policy2` ensures `resources2.x` aren't coming from same RP. For the cyborg case, I think we can propose the flavor extra specs as below: accel:device_profile.[numa node id]=<profile_name> Then we will know the user hope the cyborg device being attach to which instance numa node. The cyborg only needs to return un-numbered request group, then Nova will base on all the 'hw:xxx' extra specs and 'accel:device_profile.[numa node id]' to generate a placement request like above. For example, if it is PCI device under first numa node, the extra spec will be 'accel:device_profile.0=<profile_name>' the cyborg can return a simple request 'resources=CYBORG_PCI_XX_DEVICE:1', then we merge this into the request group 'resources1=VCPU:2,MEMORY_MB:128,CYBORG_PCI_XX_DEVICE:1'. If the pci device has a special trait, then cyborg should return request group as 'resources1=CYBORG_PCI_XX_DEVICE:1&required=SOME_TRAIT', then nova merge this into placement request as 'resources1.1'. Chris Dent <cdent+os@anticdent.org> 于2019年4月9日周二下午8:42写道：

...

Spec: https://review.openstack.org/650476

From the commit message:

To support NUMA and similar concepts, this proposes the ability to request resources from different providers nested under a common subtree (below the root provider).

There's much in the feature described by the spec and the surrounding context that is frequently a source of contention in the placement group, so working through this spec is probably going to require some robust discussion. Doing most of that before the PTG will help make sure we're not going in circles in person.k

Some of the areas of potential contention:

* Adequate for limited but maybe not all use case solutions * Strict trait constructionism * Evolving the complexity of placement solely for the satisfaction of hardware representation in Nova * Inventory-less resource providers * Developing new features in placement before existing features are fully used in client services * Others?

I list this not because they are deal breakers or the only thing that matters, but because they have presented stumbling blocks in the past and we may as well work to address them (or make an agreement to punt them until later) otherwise there will be lingering dread.

And, beyond all that squishy stuff, there is the necessary discussion over the solution described in the spec. There are several alternatives listed in the spec, and a few more in the comments. We'd like to figure out the best solution that can actually be done in a reasonable amount of time, not the best solution in the absolute.

Discuss!

-- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Sean Mooney

10:54 p.m.

...

Contribute an another idea at here. Pretty sure I didn't explore this with all the cases by my limited vision.

So I'm thinking we can continue use query string build a tree structure by the request group number. I know the number request group problem for the cyborg and neutron, but I think there must be some way to describe the cyborg device will be attached to which instance numa node. So I guess that it isn't the fault of number request group, maybe we are just missing a way to describe that.

For the case in the spec https://review.openstack.org/#/c/650476, an instance with one numa node and two VFs from different network. We can write as below:

?resources=DISK_GB:10& resources1=VCPU:2,MEMORY_MB:128& resources1.1=VF:1&required=NET_A resources1.2=VF:1&required=NET_B im not sure what NET_A and NET_B correspond to as they are not prefixed with CUSTOM_ that implies they are standard

On Mon, 2019-04-15 at 21:04 +0800, Alex Xu wrote: traits but how woudl you map dynamically created neutron network to reouce providres as traits. i can see and have argued for doing something similar for neutron physnet as tehy are mostly static and can be applied by the neutron agent to the RP they create using a CUSTOM_PHYSNET_<physnet name> trait but i dont see how NET_A woudl work.

...

Another example, we request an instance with two numa nodes, 2 vcpus and 128mb memory in each node. In each node has two VFs come from different PF to have HA.

?resources=DISK_GB:10& resources1=VCPU:2,MEMORY_MB:128& resources1.1=VF:1& resources1.2=VF:1& resources2=VCPU:2,MEMORY_MB:128& resources2.1=VF:1& resources2.2=VF:1& group_policy=isolate& group_policy1=isolate& group_policy2=isolate

this gets messy as there is no way to express that i have a 2 numa node guest and i want a vf form either numa node without changing the grouping and group policies. if we went down this road the we woudl have to generate this request dynamically(im ok with that) but that woudl mean the operator should not add resouce:... extra_spec to falvor ever. personally i would like to move in the direction of creating the placement queries dynamically and not requiring or allowing operators to specify resouce in the flavor as its the only way i can see to beable to generage a query like the one above. the main gap i see to enabling that is we have no numa infomation from neutron with regards to what numa node we shoudl attach the vf too so we cant create the request above without chaning the neutron api.

...

The `group_policy` ensure the resources1 and resources2 aren't coming from the same RP. The 'group_poilcy1' ensures `resource1.x` aren't coming from the same RP. The `group_policy2` ensures `resources2.x` aren't coming from same RP.

For the cyborg case, I think we can propose the flavor extra specs as below: accel:device_profile.[numa node id]=<profile_name>

this i think could work short term but honestly i think we should not do this. in the long term we would want to allow the device_profile to be passed on the nova boot commandlline and manage qouta/billing of device outside of flavors. we will also want to provide a policy attibe i think for virtual to host numa affitity for devices. the other asspect is we curently do not create a pci root complex per numa node until we do that we cant support requesting cyborg device per numa node the numa node id in accel:device_profile.[numa node id]=<profile_name> should be the guest numa node not a host numa node. personally i woudl prefer to create the pci root complex per numa node first and automaticlly assign the device to the correct root complex before allowing enduser to request cyborge device to be attached to a specific guest numa node as i think accel:device_profile.[numa node id]=<profile_name> might be too constiringing while also leaking to much host specific infomation via our api if it is used to select placment resouce providers and therefore host numa nodes.

...

Then we will know the user hope the cyborg device being attach to which instance numa node.

The cyborg only needs to return un-numbered request group, then Nova will base on all the 'hw:xxx' extra specs and 'accel:device_profile.[numa node id]' to generate a placement request like above.

For example, if it is PCI device under first numa node, the extra spec will be 'accel:device_profile.0=<profile_name>' the cyborg can return a simple request 'resources=CYBORG_PCI_XX_DEVICE:1', then we merge this into the request group 'resources1=VCPU:2,MEMORY_MB:128,CYBORG_PCI_XX_DEVICE:1'. If the pci device has a special trait, then cyborg should return request group as 'resources1=CYBORG_PCI_XX_DEVICE:1&required=SOME_TRAIT', then nova merge this into placement request as 'resources1.1'.

Chris Dent <cdent+os@anticdent.org> 于2019年4月9日周二下午8:42写道：

...
Spec: https://review.openstack.org/650476

From the commit message:

To support NUMA and similar concepts, this proposes the ability to request resources from different providers nested under a common subtree (below the root provider).

There's much in the feature described by the spec and the surrounding context that is frequently a source of contention in the placement group, so working through this spec is probably going to require some robust discussion. Doing most of that before the PTG will help make sure we're not going in circles in person.k

Some of the areas of potential contention:

* Adequate for limited but maybe not all use case solutions * Strict trait constructionism * Evolving the complexity of placement solely for the satisfaction of hardware representation in Nova * Inventory-less resource providers * Developing new features in placement before existing features are fully used in client services * Others?

I list this not because they are deal breakers or the only thing that matters, but because they have presented stumbling blocks in the past and we may as well work to address them (or make an agreement to punt them until later) otherwise there will be lingering dread.

And, beyond all that squishy stuff, there is the necessary discussion over the solution described in the spec. There are several alternatives listed in the spec, and a few more in the comments. We'd like to figure out the best solution that can actually be done in a reasonable amount of time, not the best solution in the absolute.

Discuss!

-- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Alex Xu

16 Apr 16 Apr

8:49 a.m.

Sorry, I missed the mailist address in the reply, there probably discussion and reply missed in the lastest email. So I reply to the mailist address with those reply, hope other people can catch up our discussion. Sean Mooney <smooney@redhat.com> 于2019年4月15日周一下午9:54写道：

...

...
Contribute an another idea at here. Pretty sure I didn't explore this with all the cases by my limited vision.

So I'm thinking we can continue use query string build a tree structure by the request group number. I know the number request group problem for the cyborg and neutron, but I think there must be some way to describe the cyborg device will be attached to which instance numa node. So I guess

...
it isn't the fault of number request group, maybe we are just missing a way to describe that.

For the case in the spec https://review.openstack.org/#/c/650476, an instance with one numa node and two VFs from different network. We can write as below:

?resources=DISK_GB:10& resources1=VCPU:2,MEMORY_MB:128& resources1.1=VF:1&required=NET_A resources1.2=VF:1&required=NET_B im not sure what NET_A and NET_B correspond to as they are not prefixed with CUSTOM_ that implies they are standard

On Mon, 2019-04-15 at 21:04 +0800, Alex Xu wrote: that traits but how woudl you map dynamically created neutron network to reouce providres as traits.

i can see and have argued for doing something similar for neutron physnet as tehy are mostly static and can be applied by the neutron agent to the RP they create using a CUSTOM_PHYSNET_<physnet name> trait but i dont see how NET_A woudl work.

Yes, it is CUSTOM_PHYSNET_NET_A/CUSTON_PHYSNET_NET_B, just use a simple version. But the case I want to show is two VFs from different physical network.

...

...
Another example, we request an instance with two numa nodes, 2 vcpus and 128mb memory in each node. In each node has two VFs come from different

PF

...
to have HA.

?resources=DISK_GB:10& resources1=VCPU:2,MEMORY_MB:128& resources1.1=VF:1& resources1.2=VF:1& resources2=VCPU:2,MEMORY_MB:128& resources2.1=VF:1& resources2.2=VF:1& group_policy=isolate& group_policy1=isolate& group_policy2=isolate

this gets messy as there is no way to express that i have a 2 numa node guest and i want a vf form either numa node without changing the grouping and group policies.

It can be done by GET /allocation_candidates? resources=DISK_GB:10,VF:1& resources1=VCPU:2,MEMORY_MB:128& resources2=VCPU:2,MEMORY_MB:128& group_policy=isolate The DISK_GB and VF are in a un-numbered request group. So they may come from any RP in the tree. http://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/gran... "The semantic for the (single) un-numbered grouping is unchanged. That is, it may still return results from different RPs in the same tree (or, when “shared” is fully implemented, the same aggregate)."

...

if we went down this road the we woudl have to generate this request dynamically(im ok with that) but that woudl mean the operator should not add resouce:... extra_spec to falvor ever.

Yes, I prefer the way generate from extra spec, not asking the operator write such complex request by hand. The opearor can continue use 'resources' extra spec, we can merge it into the generate one.

...

personally i would like to move in the direction of creating the placement queries dynamically and not requiring or allowing operators to specify resouce in the flavor as its the only way i can see to beable to generage a query like the one above. the main gap i see to enabling that is we have no numa infomation from neutron with regards to what numa node we shoudl attach the vf too so we cant create the request above without chaning the neutron api.

Yes, we are one the same side. The neutron problem see below.

...

...
The `group_policy` ensure the resources1 and resources2 aren't coming

...
the same RP. The 'group_poilcy1' ensures `resource1.x` aren't coming from the same RP. The `group_policy2` ensures `resources2.x` aren't coming from same RP.

For the cyborg case, I think we can propose the flavor extra specs as below: accel:device_profile.[numa node id]=<profile_name>

from this i think could work short term but honestly i think we should not do this. in the long term we would want to allow the device_profile to be passed on the nova boot commandlline and manage qouta/billing of device outside of flavors.

We can also allow specify guest numa node id in the boot command. But I want to say, the problem is we miss a way to specify that info for the neutron and cyborg. Other proposal in the spec doesn't resolve this problem. And I think this problem isn't the fault of request group number.

...

we will also want to provide a policy attibe i think for virtual to host numa affitity for devices.

the other asspect is we curently do not create a pci root complex per numa node until we do that we cant support requesting cyborg device per numa node the numa node id in accel:device_profile.[numa node id]=<profile_name> should be the guest numa node not a host numa node.

Yes, The "[numa node id]" in "accel:device_profile.[numa node id]" is guest numa node id. Just like other extra spec "hw:cpus.0=1,2", we are using the guest numa node id in those extra specs.

...

personally i woudl prefer to create the pci root complex per numa node first and automaticlly assign the device to the correct root complex before allowing enduser to request cyborge device to be attached to a specific guest numa node as i think accel:device_profile.[numa node id]=<profile_name> might be too constiringing while also leaking to much host specific infomation via our api if it is used to select placment resouce providers and therefore host numa nodes.

...
Then we will know the user hope the cyborg device being attach to which instance numa node.

The cyborg only needs to return un-numbered request group, then Nova will base on all the 'hw:xxx' extra specs and 'accel:device_profile.[numa node id]' to generate a placement request like above.

For example, if it is PCI device under first numa node, the extra spec

will

...
be 'accel:device_profile.0=<profile_name>' the cyborg can return a simple request 'resources=CYBORG_PCI_XX_DEVICE:1', then we merge this into the request group 'resources1=VCPU:2,MEMORY_MB:128,CYBORG_PCI_XX_DEVICE:1'. If the pci device has a special trait, then cyborg should return request group as 'resources1=CYBORG_PCI_XX_DEVICE:1&required=SOME_TRAIT', then nova merge this into placement request as 'resources1.1'.

Chris Dent <cdent+os@anticdent.org> 于2019年4月9日周二下午8:42写道：

...
Spec: https://review.openstack.org/650476

From the commit message:

To support NUMA and similar concepts, this proposes the ability to request resources from different providers nested under a common subtree (below the root provider).

There's much in the feature described by the spec and the surrounding context that is frequently a source of contention in the placement group, so working through this spec is probably going to require some robust discussion. Doing most of that before the PTG will help make sure we're not going in circles in person.k

Some of the areas of potential contention:

* Adequate for limited but maybe not all use case solutions * Strict trait constructionism * Evolving the complexity of placement solely for the satisfaction of hardware representation in Nova * Inventory-less resource providers * Developing new features in placement before existing features are fully used in client services * Others?

I list this not because they are deal breakers or the only thing that matters, but because they have presented stumbling blocks in the past and we may as well work to address them (or make an agreement to punt them until later) otherwise there will be lingering dread.

And, beyond all that squishy stuff, there is the necessary discussion over the solution described in the spec. There are several alternatives listed in the spec, and a few more in the comments. We'd like to figure out the best solution that can actually be done in a reasonable amount of time, not the best solution in the absolute.

Discuss!

-- Chris Dent ٩◔̯◔۶

https://anticdent.org/

...
freenode: cdent tw: @anticdent

Nadathur, Sundar

26 Apr 26 Apr

8:57 a.m.

From: Alex Xu <soulxu@gmail.com> Sent: Monday, April 15, 2019 4:50 PM

...

The cyborg only needs to return un-numbered request group, then Nova will base on all the 'hw:xxx' extra specs and 'accel:device_profile.[numa node id]' to generate a placement request like above.

I am not quite following the idea(s) proposed here. Cyborg returns only the device-related request groups. The un-numbered request group in the flavor is not touched by Cyborg. Secondly, if you use the ‘accel:’ stuff in the flavor to decide NUMA affinity, how will you pass that to Placement? This thread is about the syntax of the GET /a-c call.

...

For example, if it is PCI device under first numa node, the extra spec will be 'accel:device_profile.0=<profile_name>' the cyborg can return a simple request 'resources=CYBORG_PCI_XX_DEVICE:1', then we merge this into the request group 'resources1=VCPU:2,MEMORY_MB:128,CYBORG_PCI_XX_DEVICE:1'. If the pci device has a special trait, then cyborg should return request group as 'resources1=CYBORG_PCI_XX_DEVICE:1&required=SOME_TRAIT', then nova merge this into placement request as 'resources1.1'.

Sorry, I don’t follow this either. The request groups have entries like ‘resources:CUSTOM_FOO=1’, not 'resources=CYBORG_PCI_XX_DEVICE:1'. So, I don’t see where to stick the NUMA node #. Anyways, for Cyborg, it seems to me that there is a fairly straightforward scheme to address NUMA affinity: annotate the device’s nested RP with a trait indicating which NUMA node it belongs to (e.g. CUSTOM_NUMA_NODE_0), and use that to guide scheduling. This should be a valid use of traits because it expresses a property of the resource provider and is used for scheduling (only). As for how the annotation is done, it could be automated. The operator’s tool that configures a device to affinitize with a NUMA node (by setting MSI-X vectors, etc.) also invokes a Cyborg API (yet to be written) with the NUMA node # -- that would identify the device RP and update Placement with that trait. The tool needs to ensure that the device has been discovered by Cyborg and updated in Placement before invoking the API. Regards, Sundar

Alex Xu

27 Apr 27 Apr

9:49 a.m.

Nadathur, Sundar <sundar.nadathur@intel.com> 于2019年4月25日周四下午5:59写道：

...

*From:* Alex Xu <soulxu@gmail.com> *Sent:* Monday, April 15, 2019 4:50 PM

...
The cyborg only needs to return un-numbered request group, then Nova will base on all the 'hw:xxx' extra specs and 'accel:device_profile.[numa node id]' to generate a placement request like above.

I am not quite following the idea(s) proposed here. Cyborg returns only the device-related request groups. The un-numbered request group in the flavor is not touched by Cyborg.

Secondly, if you use the ‘accel:’ stuff in the flavor to decide NUMA affinity, how will you pass that to Placement? This thread is about the syntax of the GET /a-c call.

The point at here is about we need some way to enable cyborg tell nova that the device will attach to which guest numa node. I don't think we should code the device guest numa affinity info in the request group which Cyborg returned. The nova's flavor is the one tell that, and pass to the affinity requirement to the GET /a-c call.

...

...
For example, if it is PCI device under first numa node, the extra spec will be 'accel:device_profile.0=<profile_name>' the cyborg can return a simple request 'resources=CYBORG_PCI_XX_DEVICE:1', then we merge this into the request group 'resources1=VCPU:2,MEMORY_MB:128,CYBORG_PCI_XX_DEVICE:1'. If the pci device has a special trait, then cyborg should return request group as 'resources1=CYBORG_PCI_XX_DEVICE:1&required=SOME_TRAIT', then nova merge this into placement request as 'resources1.1'.

Sorry, I don’t follow this either. The request groups have entries like ‘resources:CUSTOM_FOO=1’, not 'resources=CYBORG_PCI_XX_DEVICE:1'. So, I don’t see where to stick the NUMA node #.

Anyways, for Cyborg, it seems to me that there is a fairly straightforward scheme to address NUMA affinity: annotate the device’s nested RP with a trait indicating which NUMA node it belongs to (e.g. CUSTOM_NUMA_NODE_0), and use that to guide scheduling. This should be a valid use of traits because it expresses a property of the resource provider and is used for scheduling (only).

I don't like the way of using trait to mark out the NUMA node.

...

As for how the annotation is done, it could be automated. The operator’s tool that configures a device to affinitize with a NUMA node (by setting MSI-X vectors, etc.) also invokes a Cyborg API (yet to be written) with the NUMA node # -- that would identify the device RP and update Placement with that trait. The tool needs to ensure that the device has been discovered by Cyborg and updated in Placement before invoking the API.

What I'm talking about at here is about the virtual device attach to which guest numa node. It isn't about the a physical device affinity to which host numa node.

...

Regards,

Sundar

Jay Pipes

28 Apr 28 Apr

12:52 a.m.

On 04/26/2019 08:49 PM, Alex Xu wrote:

...

Nadathur, Sundar <sundar.nadathur@intel.com Anyways, for Cyborg, it seems to me that there is a fairly straightforward scheme to address NUMA affinity: annotate the device’s nested RP with a trait indicating which NUMA node it belongs to (e.g. CUSTOM_NUMA_NODE_0), and use that to guide scheduling. This should be a valid use of traits because it expresses a property of the resource provider and is used for scheduling (only).

I don't like the way of using trait to mark out the NUMA node.

Me neither. Traits are capabilities, not indicators of the relationship between one provider and another. The structure of hierarchical resource providers is what provides topology information -- i.e. about how providers are related to each other within a tree organization, and this is what is appropriate for encoding NUMA topology information into placement. The request should never ask for "NUMA Node 0". The reason is because the request shouldn't require that the user understand where the resources are. It shouldn't matter *which* NUMA node a particular device that is providing some resources is affined to. The only thing that matters to a *request* is that the user is able to describe the nature of the affinity. I propose using a "group_policy=same_tree:$GROUP_A:$GROUP_B" query parameter for enabling users to describe the affinity constraints for various resources involved in different RequestGroups in the request spec. group_policy=same_tree:$A:$B would mean "ensure that the providers that match the constraints of request group $B are in the same inclusive tree that matched for request group $A" So, let's say you have a flavor that will consume: 2 dedicated host CPU processors 4GB RAM 1 context/handle for an accelerator running a crypto algorithm Further, you want to ensure that the provider tree that is providing those dedicated CPUs and RAM will also provide the accelerator context -- in other words, you are requesting a low level of latency between the memory and the accelerator device itself. The above request to GET /a_c would look like this: GET /a_c? resources1=PCPU:2& resources1=MEMORY_MB=4096& resources2=ACCELERATOR_CONTEXT& required2=CUSTOM_BITSTREAM_CRYPTO_4AC1& group_policy=same_tree:1:2 which would mean, in English, "get me an accelerator context from an FPGA that has been flashed with the 4AC1 crypto bitstream and is affined to the NUMA node that is providing 4G of main memory and 2 dedicated host processors". Best, -jay

Nadathur, Sundar

3:26 a.m.

Hi Jay and Alex, Thanks for the response. Please see below. Regards, Sundar

...

-----Original Message----- From: Jay Pipes <jaypipes@gmail.com> Sent: Saturday, April 27, 2019 8:52 AM To: openstack-discuss@lists.openstack.org Subject: Re: [placement][nova][ptg] resource provider affinity

On 04/26/2019 08:49 PM, Alex Xu wrote:

...
Nadathur, Sundar <sundar.nadathur@intel.com Anyways, for Cyborg, it seems to me that there is a fairly straightforward scheme to address NUMA affinity: annotate the device’s nested RP with a trait indicating which NUMA node it belongs to (e.g. CUSTOM_NUMA_NODE_0), and use that to guide scheduling. This should be a valid use of traits because it expresses a property of the resource provider and is used for scheduling (only).

I don't like the way of using trait to mark out the NUMA node.

Me neither. Traits are capabilities, not indicators of the relationship between one provider and another.

The structure of hierarchical resource providers is what provides topology information -- i.e. about how providers are related to each other within a tree organization, and this is what is appropriate for encoding NUMA topology information into placement.

The request should never ask for "NUMA Node 0". The reason is because the request shouldn't require that the user understand where the resources are.

I agree with this for most use cases. However, there are specific cases where a reference architecture is laid out for a specific workload, which requires a specific number of VMs to be placed in each NUMA node, with specific number of devices (NICs or accelerators) assigned to them. The network bandwidth, computation load, etc. are all pre-calculated to fit the VM's size and device characteristics. Any departure from that may affect workload performance -- throughput, latency or jitter. However, if the request says, 'Give me a VM on _a_ NUMA node, I don't care which one', one may wind up with say 3 VMs on one NUMA node and 1 VM on the other, which is not the intended outcome. One could argue that we should model all resources, such as PCIe lanes/bandwidth from a socket (not the same as NUMA node), to the point where we can influence the exact placement among NUMA nodes. This has several issues, IMHO: * This is more tied to the hardware details. * Many of these resources are not dedicated or partitionable among VMs, e.g. PCIe lanes. I don’t see how we can track and count them in Placement on per-VM basis. * It is more complex for both developers and operators. In this situation, the operator is willing (in my understanding) to phrase the request precisely to get the exact desired layout.

...

It shouldn't matter *which* NUMA node a particular device that is providing some resources is affined to. The only thing that matters to a *request* is that the user is able to describe the nature of the affinity.

I propose using a "group_policy=same_tree:$GROUP_A:$GROUP_B" query parameter for enabling users to describe the affinity constraints for various resources involved in different RequestGroups in the request spec.

group_policy=same_tree:$A:$B would mean "ensure that the providers that match the constraints of request group $B are in the same inclusive tree that matched for request group $A"

Request groups from Neutron and Cyborg do not have any inherent group numbers; Nova assigns those group numbers before submission to Placement. So, the GET /a-c call could technically have such numbers, but how would Neutron or Cyborg express that affinity?

...

So, let's say you have a flavor that will consume:

2 dedicated host CPU processors 4GB RAM 1 context/handle for an accelerator running a crypto algorithm

Further, you want to ensure that the provider tree that is providing those dedicated CPUs and RAM will also provide the accelerator context -- in other words, you are requesting a low level of latency between the memory and the accelerator device itself.

The above request to GET /a_c would look like this:

GET /a_c? resources1=PCPU:2& resources1=MEMORY_MB=4096& resources2=ACCELERATOR_CONTEXT& required2=CUSTOM_BITSTREAM_CRYPTO_4AC1& group_policy=same_tree:1:2

which would mean, in English, "get me an accelerator context from an FPGA that has been flashed with the 4AC1 crypto bitstream and is affined to the NUMA node that is providing 4G of main memory and 2 dedicated host processors".

Best, -jay

Alex Xu

29 Apr 29 Apr

12:52 p.m.

Nadathur, Sundar <sundar.nadathur@intel.com> 于2019年4月27日周六下午12:29写道：

...

Hi Jay and Alex, Thanks for the response. Please see below.

Regards, Sundar

...
-----Original Message----- From: Jay Pipes <jaypipes@gmail.com> Sent: Saturday, April 27, 2019 8:52 AM To: openstack-discuss@lists.openstack.org Subject: Re: [placement][nova][ptg] resource provider affinity

On 04/26/2019 08:49 PM, Alex Xu wrote:

...
Nadathur, Sundar <sundar.nadathur@intel.com Anyways, for Cyborg, it seems to me that there is a fairly straightforward scheme to address NUMA affinity: annotate the device’s nested RP with a trait indicating which NUMA node it belongs to (e.g. CUSTOM_NUMA_NODE_0), and use that to guide scheduling. This should be a valid use of traits because it expresses a property of the resource provider and is used for scheduling (only).

I don't like the way of using trait to mark out the NUMA node.

Me neither. Traits are capabilities, not indicators of the relationship between one provider and another.

The structure of hierarchical resource providers is what provides topology information -- i.e. about how providers are related to each other within a tree organization, and this is what is appropriate for encoding NUMA topology information into placement.

The request should never ask for "NUMA Node 0". The reason is because the request shouldn't require that the user understand where the resources are.

I agree with this for most use cases. However, there are specific cases where a reference architecture is laid out for a specific workload, which requires a specific number of VMs to be placed in each NUMA node, with specific number of devices (NICs or accelerators) assigned to them. The network bandwidth, computation load, etc. are all pre-calculated to fit the VM's size and device characteristics. Any departure from that may affect workload performance -- throughput, latency or jitter. However, if the request says, 'Give me a VM on _a_ NUMA node, I don't care which one', one may wind up with say 3 VMs on one NUMA node and 1 VM on the other, which is not the intended outcome.

We can control the number of available shared/dedicated vCPUs in a NUMA node, so people will know how many VMs in each NUMA node. If you affinity multiple VM into same NUMA node, that will be another thing, the affinity between VM.

...

One could argue that we should model all resources, such as PCIe lanes/bandwidth from a socket (not the same as NUMA node), to the point where we can influence the exact placement among NUMA nodes. This has several issues, IMHO: * This is more tied to the hardware details. * Many of these resources are not dedicated or partitionable among VMs, e.g. PCIe lanes. I don’t see how we can track and count them in Placement on per-VM basis. * It is more complex for both developers and operators.

I don't think that is the purpose of modeling socket in the placement.

...

In this situation, the operator is willing (in my understanding) to phrase the request precisely to get the exact desired layout.

...
It shouldn't matter *which* NUMA node a particular device that is providing some resources is affined to. The only thing that matters to a *request* is that the user is able to describe the nature of the affinity.

I propose using a "group_policy=same_tree:$GROUP_A:$GROUP_B" query parameter for enabling users to describe the affinity constraints for various resources involved in different RequestGroups in the request spec.

group_policy=same_tree:$A:$B would mean "ensure that the providers that match the constraints of request group $B are in the same inclusive tree that matched for request group $A"

Request groups from Neutron and Cyborg do not have any inherent group numbers; Nova assigns those group numbers before submission to Placement. So, the GET /a-c call could technically have such numbers, but how would Neutron or Cyborg express that affinity?

Yea, that is what I'm saying, it should be in the nova flavor.

...

...
So, let's say you have a flavor that will consume:

2 dedicated host CPU processors 4GB RAM 1 context/handle for an accelerator running a crypto algorithm

Further, you want to ensure that the provider tree that is providing those dedicated CPUs and RAM will also provide the accelerator context -- in other words, you are requesting a low level of latency between the memory and the accelerator device itself.

The above request to GET /a_c would look like this:

GET /a_c? resources1=PCPU:2& resources1=MEMORY_MB=4096& resources2=ACCELERATOR_CONTEXT& required2=CUSTOM_BITSTREAM_CRYPTO_4AC1& group_policy=same_tree:1:2

which would mean, in English, "get me an accelerator context from an FPGA that has been flashed with the 4AC1 crypto bitstream and is affined to the NUMA node that is providing 4G of main memory and 2 dedicated host processors".

Best, -jay

Chris Dent

8:45 a.m.

On Sat, 27 Apr 2019, Jay Pipes wrote:

...

The request should never ask for "NUMA Node 0". The reason is because the request shouldn't require that the user understand where the resources are.

It shouldn't matter *which* NUMA node a particular device that is providing some resources is affined to. The only thing that matters to a *request* is that the user is able to describe the nature of the affinity.

Yes, very much yes to these two paragraphs. See also: http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005682.htm...

...

I propose using a "group_policy=same_tree:$GROUP_A:$GROUP_B" query parameter for enabling users to describe the affinity constraints for various resources involved in different RequestGroups in the request spec.

At first glance this seems pretty reasonable. Does anyone hate it? -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Eric Fried

9:01 a.m.

...

...
I propose using a "group_policy=same_tree:$GROUP_A:$GROUP_B" query parameter for enabling users to describe the affinity constraints for various resources involved in different RequestGroups in the request spec.

At first glance this seems pretty reasonable. Does anyone hate it?

We've talked about this previously. The two objections raised were: a) It assumes the meaning of "same tree" is "one level down from the root". This satisfies NUMA affinity, and also allows us to do things like [1] (scroll down to the pretty picture where networking agents are subtree roots). But it may prove too limiting in the future if, for example, we need to represent sockets *under* NUMA nodes and do L3 cache affinity. (Come to think of it, if we need to do [1] in the presence of NUMA and need to affine the network devices to the CPUs, what would that whole tree look like, and how would it affect this proposal?) b) It assumes the various pieces of the request (flavor, image, port, device profile) are able to know each others' request group numbers ahead of time. Or we need provide some other mechanism for the scheduler code that dynamically assigns the numbers [2] to understand which ones need to be (sub)grouped together. IIUC this has been Sundar's main objection. efried [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005111.htm... [2] https://opendev.org/openstack/nova/src/branch/master/nova/scheduler/utils.py...

Chris Dent

1:07 p.m.

On Sun, 28 Apr 2019, Eric Fried wrote:

...

We've talked about this previously. The two objections raised were:

a) It assumes the meaning of "same tree" is "one level down from the root".

Does it? I had casually interpreted "group_policy=same_tree:$GROUP_A:$GROUP_B" as meaning '$GROUP_B is somewhere within the tree rooted at $GROUP_A at any level' but it could just as easily be interpreted a few different ways, including what you say.

...

b) It assumes the various pieces of the request (flavor, image, port, device profile) are able to know each others' request group numbers ahead of time. Or we need provide some other mechanism for the scheduler code that dynamically assigns the numbers [2] to understand which ones need to be (sub)grouped together. IIUC this has been Sundar's main objection.

As I understand things, this is going to be a problem in most of the proposals, for at least one of the many participants in the interactions that lead to a complex workload landing. Jay suggested extending the JSON schema to allow groups that are names like resources_compute, required_network. That might allow for some conventions to emerge but still requires some measure of knowledge from the participants. I suspect some form of knowledge is going to be needed. Limiting it would be good. Also good is making sure that from placement's standpoint the knowledge is merely symbolic. -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Nadathur, Sundar

1:56 p.m.

Hi Chris, Hope you saw my email in this thread on Sat. Using group numbers is not good because [1]: * Request groups from Neutron for bw provider and from Cyborg device profiles will not have explicit group numbers. * Neutron/Cyborg devices should be affined with VCPU/memory, but we shouldn't assume the number of the RG that asks for VCPU/memory. * When those request groups are merged with those from the flavor, the flavor group numbering *could* change. To avoid that, Nova would have to make an explicit guarantee that the Neutron/Cyborg groups get added after the flavor RGs. I am not advocating this. Also, we need some solution for directed affinitizing of specific number of VMs with NUMA nodes, as mentioned in my earlier email. [1] Taken from my comment on https://review.opendev.org/#/c/650476/ Regards, Sundar

...

-----Original Message----- From: Chris Dent <cdent+os@anticdent.org> Sent: Sunday, April 28, 2019 10:07 PM To: openstack-discuss@lists.openstack.org Subject: Re: [placement][nova][ptg] resource provider affinity

On Sun, 28 Apr 2019, Eric Fried wrote:

...
We've talked about this previously. The two objections raised were:

a) It assumes the meaning of "same tree" is "one level down from the root".

Does it? I had casually interpreted "group_policy=same_tree:$GROUP_A:$GROUP_B" as meaning '$GROUP_B is somewhere within the tree rooted at $GROUP_A at any level' but it could just as easily be interpreted a few different ways, including what you say.

...
b) It assumes the various pieces of the request (flavor, image, port, device profile) are able to know each others' request group numbers ahead of time. Or we need provide some other mechanism for the scheduler code that dynamically assigns the numbers [2] to understand which ones need to be (sub)grouped together. IIUC this has been Sundar's main objection.

As I understand things, this is going to be a problem in most of the proposals, for at least one of the many participants in the interactions that lead to a complex workload landing.

Jay suggested extending the JSON schema to allow groups that are names like resources_compute, required_network. That might allow for some conventions to emerge but still requires some measure of knowledge from the participants.

I suspect some form of knowledge is going to be needed. Limiting it would be good.

Also good is making sure that from placement's standpoint the knowledge is merely symbolic.

-- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Alex Xu

2:02 p.m.

Chris Dent <cdent+os@anticdent.org> 于2019年4月28日周日下午10:14写道：

...

On Sun, 28 Apr 2019, Eric Fried wrote:

...
We've talked about this previously. The two objections raised were:

a) It assumes the meaning of "same tree" is "one level down from the root".

Does it? I had casually interpreted "group_policy=same_tree:$GROUP_A:$GROUP_B" as meaning '$GROUP_B is somewhere within the tree rooted at $GROUP_A at any level' but it could just as easily be interpreted a few different ways, including what you say.

...
b) It assumes the various pieces of the request (flavor, image, port, device profile) are able to know each others' request group numbers ahead of time. Or we need provide some other mechanism for the scheduler code that dynamically assigns the numbers [2] to understand which ones need to be (sub)grouped together. IIUC this has been Sundar's main objection.

As I understand things, this is going to be a problem in most of the proposals, for at least one of the many participants in the interactions that lead to a complex workload landing.

Jay suggested extending the JSON schema to allow groups that are names like resources_compute, required_network. That might allow for some conventions to emerge but still requires some measure of knowledge from the participants.

I thought the placement, cyborg, and neutron..etc doesn't what is building. Placement doesn't know what it is building from 'GET /a_c', it just return the right RP match the request. Cyborg and neutron only returns a device or a port requirement. So only Nova knows we are building VM, then nova should know the affinity of those resources.

...

I suspect some form of knowledge is going to be needed. Limiting it would be good.

Also good is making sure that from placement's standpoint the knowledge is merely symbolic.

-- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Ed Leafe

30 Apr 30 Apr

2:40 a.m.

On Apr 28, 2019, at 11:02 PM, Alex Xu <soulxu@gmail.com> wrote:

...

I thought the placement, cyborg, and neutron..etc doesn't what is building. Placement doesn't know what it is building from 'GET /a_c', it just return the right RP match the request.

This is an important point that seems to get pushed aside in order to “get feature X done”. -- Ed Leafe

Eric Fried

6:38 a.m.

...

...
a) It assumes the meaning of "same tree" is "one level down from the root".

Does it? I had casually interpreted "group_policy=same_tree:$GROUP_A:$GROUP_B" as meaning '$GROUP_B is somewhere within the tree rooted at $GROUP_A at any level' but it could just as easily be interpreted a few different ways, including what you say.

If I interpret that ^ correctly, it would require $GROUP_A (the subtree root) to provide resources, a scenario for which we have at least one counterexample (the one with network agents as resourceless providers).

...

Jay suggested extending the JSON schema to allow groups that are names like resources_compute, required_network. That might allow for some conventions to emerge but still requires some measure of knowledge from the participants.

I think this is a good idea to pursue, because it gives us a way to predefine (by convention) what the groups are called, as opposed to having them be automatically, arbitrarily, unpredictably numbered. It'll still break down in more complex scenarios where, say, there's more than one device_group with different affinity requirements; but it could work for the simpler setups.

...

Also good is making sure that from placement's standpoint the knowledge is merely symbolic.

Sure. Just like the group numbers, the names would be just as arbitrary/symbolic as the group numbers are today. And since we're talking about reporting group_num/resource_provider association, those symbols need to survive for the duration of the GET /a_c operation, but no longer than that. efried .

Chris Dent

7:50 a.m.

On Mon, 29 Apr 2019, Eric Fried wrote:

...

...
...
a) It assumes the meaning of "same tree" is "one level down from the root".

Does it? I had casually interpreted "group_policy=same_tree:$GROUP_A:$GROUP_B" as meaning '$GROUP_B is somewhere within the tree rooted at $GROUP_A at any level' but it could just as easily be interpreted a few different ways, including what you say.

If I interpret that ^ correctly, it would require $GROUP_A (the subtree root) to provide resources, a scenario for which we have at least one counterexample (the one with network agents as resourceless providers).

Why would it require GROUP_A to provide resources? Haven't we already established that we're going to need to lighten the requirement that 'requiredN' must have a 'resourcesN'? If we haven't, perhaps this is the thing that will push us that way? -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Eric Fried

9:35 a.m.

...

Why would it require GROUP_A to provide resources? Haven't we already established that we're going to need to lighten the requirement that 'requiredN' must have a 'resourcesN'? If we haven't, perhaps this is the thing that will push us that way?

Oic, so group A could be a (potentially resourceless) provider with a trait like NUMA_NODE or SRIOV_NET_AGENT. If we can get by the other issue - to which I shall henceforth refer as "inter-group referencing" - I could get behind this. Did we decide on "traits (and/or aggregates) flow down" too? I'm losing track of how all these things interact and which combinations are necessary to solve which use cases. efried .

Chris Dent

11:12 a.m.

On Mon, 29 Apr 2019, Eric Fried wrote:

...

Did we decide on "traits (and/or aggregates) flow down" too? I'm losing track of how all these things interact and which combinations are necessary to solve which use cases.

I agree that it is getting hard to track. It seems that at least both Jay and I are interested in seeing if "X flow down" is workable. What would make sense, to me, is to form a coherent model that captures these ideas in a consistent fashion, and see which uses cases it can satisfy well, which it cannot, and if those it cannot can be substituted by some other solution (or dismissed (as not cloudy?)). The ideas seem to be: * X flow down * same_tree:$GROUP_A:$GROUP_B group policy referencing * resource-less resource providers (and thus request groups with requireds, but not resources) Does that jibe with what other people have been reading and thinking? Note that I don't think we should be looking for the perfect 100% solution here. What we should be looking for is a good model that makes it easier to satisfy some large percentage of the use cases efficiently (both in running the solution and creating it). Sometimes you can't do everything. -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Balázs Gibizer

10:45 p.m.

On Tue, Apr 30, 2019 at 2:35 AM, Eric Fried <openstack@fried.cc> wrote:

...

Did we decide on "traits (and/or aggregates) flow down" too? I'm losing track of how all these things interact and which combinations are necessary to solve which use cases.

Can we start collecting use cases in a spec doc with provider tree examples and a_c queries against those trees written in plain English? I'm happy to contribute to that spec with the bandwidth related examples. gibi

Tetsuro Nakamura

3 May 3 May

11:22 p.m.

Sorry for the late response, Here is my thoughts on "resource provider affinity". “The rps are in a same subtree” is equivalent to “there exits an rp which is an ancestor of all the other rps” Therefore, * group_resources=1:2 means “rp2 is a descendent of rp1 (or rp1 is a descendent of rp2.)” We can extend it to cases we have more than two groups: * group_resources=1:2:3 means "both rp2 and rp3 are descendents of rp1 (or both rp1 and rp3 are of rp2 or both rp1 and rp2 are of rp3) Eric's question from PTG yesterday was whether to keep the symmetry between rps, that is, whether to take the conditions enclosed in the parentheses above. I would say yes keep the symmetry because 1. the expression 1:2:3 is more of symmetry. If we want to make it asymmetric, it should express the subtree root more explicitly like 1-2:3 or 1-2:3:4. 2. callers may not be aware of which resource (VCPU or VF) is provided by the upper/lower rp. IOW, the caller - resource retriever (scheduler) - doesn't want to know how the reporter - virt driver - has reported the resouces. Note that even in the symmetric world the negative expression jay suggested looks good to me. It enables something like: * group_resources=1:2:!3:!4 which means 1 and 2 should be in the same group but 3 shoudn't be the descendents of 1 or 2, so as 4. However, speaking in the design level, the adjacency list model (so called naive tree model), which we currently use for nested rps, is not good at retrieving subtrees (compared to e.g. nested set model[1]). [1] https://en.wikipedia.org/wiki/Nested_set_model I have looked into recursive SQL CTE (common table expression) feature which help us treat subtree easily in adjacency list model in a experimental patch [2], but unfortunately it looks like the feature is still experimental in MySQL, and we don't want to query like this per every candidates, do we? :( [2] https://review.opendev.org/#/c/636092/ Therefore, for this specific use case of NUMA affinity I'd like alternatively propose bringing a concept of resource group distance in the rp graph. * numa affinity case - group_distance(1:2)=1 * anti numa affinity - group_distance(1:2)>1 which can be realized by looking into the cached adjacency rp (i.e. parent id) (supporting group_distance=N (N>1) would be a future research or implement anyway overlooking the performance) One drawback of this is that we can't use this if you create multiple nested layers with more than 1 depth under NUMA rps, but is that the case for OvS bandwidth? Another alternative is having a "closure table" from where we can retrieve all the descendent rp ids of an rp without joining tables. but... online migration cost? - tetsuro

Eric Fried

4 May 4 May

12:03 a.m.

...

“The rps are in a same subtree” is equivalent to “there exits an rp which is an ancestor of all the other rps”

...

I would say yes keep the symmetry because

1. the expression 1:2:3 is more of symmetry. If we want to make it asymmetric, it should express the subtree root more explicitly like 1-2:3 or 1-2:3:4. 2. callers may not be aware of which resource (VCPU or VF) is provided by the upper/lower rp. IOW, the caller - resource retriever (scheduler) - doesn't want to know how the reporter - virt driver - has reported the resouces.

This. (If we were going to do asymmetric, I agree we would need a clearer syntax. Another option I thought of was same_subtree1=2,3,!4. But still prefer symmetric.)

...

It enables something like: * group_resources=1:2:!3:!4 which means 1 and 2 should be in the same group but 3 shoudn't be the descendents of 1 or 2, so as 4.

In a symmetric world, this one is a little ambiguous to me. Does it mean 4 shouldn't be in the same subtree as 3 as well?

...

However, speaking in the design level, the adjacency list model (so called naive tree model), which we currently use for nested rps, is not good at retrieving subtrees <snip>

Based on my limited understanding, we may want to consider at least initially *not* trying to do this in sql. We can gather the candidates as we currently do and then filter them afterward in python (somewhere in the _merge_candidates flow).

...

One drawback of this is that we can't use this if you create multiple nested layers with more than 1 depth under NUMA rps, but is that the case for OvS bandwidth?

If the restriction is because "the SQL is difficult", I would prefer not to introduce a "distance" concept. We've come up with use cases where the nesting isn't simple.

...

Another alternative is having a "closure table" from where we can retrieve all the descendent rp ids of an rp without joining tables. but... online migration cost?

Can we consider these optimizations later, if the python-side solution proves non-performant? efried .

Sylvain Bauza

12:57 a.m.

On Fri, May 3, 2019 at 9:24 AM Eric Fried <openstack@fried.cc> wrote:

...

...
“The rps are in a same subtree” is equivalent to “there exits an rp which is an ancestor of all the other rps”

++

...
I would say yes keep the symmetry because

1. the expression 1:2:3 is more of symmetry. If we want to make it asymmetric, it should express the subtree root more explicitly like 1-2:3 or 1-2:3:4. 2. callers may not be aware of which resource (VCPU or VF) is provided by the upper/lower rp. IOW, the caller - resource retriever (scheduler) - doesn't want to know how the reporter - virt driver - has reported the resouces.

This.

(If we were going to do asymmetric, I agree we would need a clearer syntax. Another option I thought of was same_subtree1=2,3,!4. But still prefer symmetric.)

...
It enables something like: * group_resources=1:2:!3:!4 which means 1 and 2 should be in the same group but 3 shoudn't be the descendents of 1 or 2, so as 4.

In a symmetric world, this one is a little ambiguous to me. Does it mean 4 shouldn't be in the same subtree as 3 as well?

First, thanks Tetsuro for investigating ways to support such queries. Very much appreciated. I hope I can dedicate a few time this cycle to see whether I could help with implementing NUMA affinity as I see myself as the first consumer of such thing :-)

...

...
However, speaking in the design level, the adjacency list model (so called naive tree model), which we currently use for nested rps, is not good at retrieving subtrees <snip>

Based on my limited understanding, we may want to consider at least initially *not* trying to do this in sql. We can gather the candidates as we currently do and then filter them afterward in python (somewhere in the _merge_candidates flow).

...
One drawback of this is that we can't use this if you create multiple nested layers with more than 1 depth under NUMA rps, but is that the case for OvS bandwidth?

If the restriction is because "the SQL is difficult", I would prefer not to introduce a "distance" concept. We've come up with use cases where the nesting isn't simple.

...
Another alternative is having a "closure table" from where we can retrieve all the descendent rp ids of an rp without joining tables. but... online migration cost?

Can we consider these optimizations later, if the python-side solution proves non-performant?

Huh, IMHO the whole benefits of having SQL with Placement was that we were getting a fast distributed lock proven safe. Here, this is a read so I don't really bother on any potential contention, but I just wanted to say that if we go this way, we absolutely need to make enough safeguards so that we don't loose the key interest of Placement. This is not trivial either way then. -Sylvain efried

...

.

Chris Dent

6:13 a.m.

On Fri, 3 May 2019, Eric Fried wrote:

...

...
Another alternative is having a "closure table" from where we can retrieve all the descendent rp ids of an rp without joining tables. but... online migration cost?

Can we consider these optimizations later, if the python-side solution proves non-performant?

My preference would be that we start with the simplest option (make multiple selects, merge them appropriately in Python) and, as Eric says, if that's not good enough, pursue the optimizations. In fact, I think we should likely pursue the optimizations [1] in any case, but they should come _after_ we have some measurements. Jay provided a proposed algorithm in [2]. We have a time slot tomorrow (Saturday May 3) at 13:30 to discuss some of the finer points of implementing nested magic [3]. [1] Making placement faster is constantly a goal, but it is a secondary goal. [2] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005432.htm... [3] http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005823.html -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Sylvain Bauza

6:34 a.m.

On Fri, May 3, 2019 at 3:19 PM Chris Dent <cdent+os@anticdent.org> wrote:

...

On Fri, 3 May 2019, Eric Fried wrote:

...
...
Another alternative is having a "closure table" from where we can retrieve all the descendent rp ids of an rp without joining tables. but... online migration cost?

Can we consider these optimizations later, if the python-side solution proves non-performant?

My preference would be that we start with the simplest option (make multiple selects, merge them appropriately in Python) and, as Eric says, if that's not good enough, pursue the optimizations.

In fact, I think we should likely pursue the optimizations [1] in any case, but they should come _after_ we have some measurements.

Jay provided a proposed algorithm in [2].

That plan looks good to me, with the slight detail that I want to reinforce the fact that python usage will have a cost anyway, which is to drift us from the perfect world of having a distributed transactional model for free. This is to say, we should refrain *at the maximum* any attempt to get rid of SQL and use Python (or other tools) until we get a solid consensus on those tools being as efficient and as accurately possible than the current situation. We have a time slot tomorrow (Saturday May 3) at 13:30 to discuss

...

some of the finer points of implementing nested magic [3].

I'll try to be present. -Sylvain

...

[1] Making placement faster is constantly a goal, but it is a secondary goal.

[2] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005432.htm...

[3] http://lists.openstack.org/pipermail/openstack-discuss/2019-May/005823.html

-- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Tetsuro Nakamura

5 May 5 May

3:40 a.m.

...

...
It enables something like: * group_resources=1:2:!3:!4 which means 1 and 2 should be in the same group but 3 shoudn't be the descendents of 1 or 2, so as 4. In a symmetric world, this one is a little ambiguous to me. Does it mean 4 shouldn't be in the same subtree as 3 as well? I thought the negative folks were just refusing to be with in the

Okay, I was missing that at the point to merge each candidate from each request groups, all the rps info in the trees are already in ProviderSummaries, and we can use them without an additional query. It looks like that this can be done without impacting the performance of existing requests that have no queryparam for affinity, so I'm good with this and can volunteer it in Placement since this is more of general "subtree" thing, but I'd like to say that looking into tracking PCPU feature in Nova and see the related problems should precede any Nova related items to model NUMA in Placement. On 2019/05/04 0:03, Eric Fried wrote: positive folks. Looks like there are use cases where we need multiple group_resources? - I want 1, 2 in the same subtree, and 3, 4 in the same subtree but the two subtrees should be separated: * group_resources=1:2:!3:!4&group_resources=3:4 -- Tetsuro Nakamura <nakamura.tetsuro@lab.ntt.co.jp> NTT Network Service Systems Laboratories TEL:0422 59 6914(National)/+81 422 59 6914(International) 3-9-11, Midori-Cho Musashino-Shi, Tokyo 180-8585 Japan

Eric Fried

5:02 a.m.

...

It looks like that this can be done without impacting the performance of existing requests that have no queryparam for affinity,

Well, the concern is that doing this at _merge_candidates time (i.e. in python) may be slow. But yeah, let's not solve that until/unless we see it's truly a problem.

...

but I'd like to say that looking into tracking PCPU feature in Nova and see the related problems should precede any Nova related items to model NUMA in Placement.

To be clear, placement doesn't need any changes for this. I definitely don't think we should wait for it to land before starting on the placement side of the affinity work.

...

I thought the negative folks were just refusing to be with in the positive folks. Looks like there are use cases where we need multiple group_resources?

Yes, certainly eventually we'll need this, even just for positive affinity. Example: I want two VCPUs, two chunks of memory, and two accelerators. Each VCPU/memory/accelerator combo must be affined to the same NUMA node so I can maximize the performance of the accelerator. But I don't care whether both combos come from the same or different NUMA nodes: ?resources_compute1=VCPU:1,MEMORY_MB:1024 &resources_accel1=FPGA:1 &same_subtree:compute1,accel1 &resources_compute2=VCPU:1,MEMORY_MB:1024 &resources_accel2=FPGA:1 &same_subtree:compute2,accel2 and what I want to get in return is: candidates: (1) NUMA1 has VCPU:1,MEMORY_MB:1024,FPGA:1; NUMA2 likewise (2) NUMA1 has everything (3) NUMA2 has everything Slight aside, could we do this with can_split and just one same_subtree? I'm not sure you could expect the intended result from: ?resources_compute=VCPU:2,MEMORY_MB:2048 &resources_accel=FPGA:2 &same_subtree:compute,accel &can_split:compute,accel Intuitively, I think the above *either* means you don't get (1), *or* it means you can get (1)-(3) *plus* things like: (4) NUMA1 has VCPU:2,MEMORY_MB:2048; NUMA2 has FPGA:2

...

- I want 1, 2 in the same subtree, and 3, 4 in the same subtree but the two subtrees should be separated:

* group_resources=1:2:!3:!4&group_resources=3:4

Right, and this too. As a first pass, I would be fine with supporting only positive affinity. And if it makes things significantly easier, supporting only a single group_resources per call. efried .

Eric Fried

7:56 a.m.

For those of you following along at home, we had a design session a couple of hours ago and hammered out the broad strokes of this work, including rough prioritization of the various pieces. Chris has updated the story [1] with a couple of notes; expect details and specs to emerge therefrom. efried [1] https://storyboard.openstack.org/#!/story/2005575

Nadathur, Sundar

9 May 9 May

4:53 a.m.

Thanks, Eric and Chris. Can this scheme address this use case? I have a set of compute hosts, each with several NICs of type T. Each NIC has a set of PFs: PF1, PF2, .... Each PF is a resource provider, and each has a separate custom RC: CUSTOM_RC_PF1, CUSTOM_RC_PF2, ... . The VFs are inventories of the associated PF's RC. Provider networks etc. are traits on that PF. The use case is to schedule a VM with several Neutron ports coming from the same NIC card and tied to specific networks. Let us say we (somehow) translate this to a set of request groups like this: resources_T1:CUSTOM_RC_PF1 = 2 # Note: T is the NIC name, and we are asking for VFs as resources. traits_T1:CUSTOM_TRAIT_MYNET1 = required resources_T2:CUSTOM_RC_PF2 = 1 traits_T2:CUSTOM_TRAIT_MYNET2 = required "same_subtree=%s" % ','.join(suffix for suffix in all_suffixes if suffix.startswith('T')) Will this ensure that all allocations come from the same NIC card? Do I have to create a 'resourceless RP' for the NIC card that contains the individual PF RPs as children nodes? P.S.: Ignore the comments I added to https://storyboard.openstack.org/#!/story/2005575#comment-122255. Regards, Sundar

...

-----Original Message----- From: Eric Fried <openstack@fried.cc> Sent: Saturday, May 4, 2019 3:57 PM To: openstack-discuss@lists.openstack.org Subject: Re: [placement][nova][ptg] resource provider affinity

For those of you following along at home, we had a design session a couple of hours ago and hammered out the broad strokes of this work, including rough prioritization of the various pieces. Chris has updated the story [1] with a couple of notes; expect details and specs to emerge therefrom.

efried

[1] https://storyboard.openstack.org/#!/story/2005575

Eric Fried

6:31 a.m.

Sundar-

...

I have a set of compute hosts, each with several NICs of type T. Each NIC has a set of PFs: PF1, PF2, .... Each PF is a resource provider, and each has a separate custom RC: CUSTOM_RC_PF1, CUSTOM_RC_PF2, ... . The VFs are inventories of the associated PF's RC. Provider networks etc. are traits on that PF.

...

Do I have to create a 'resourceless RP' for the NIC card that contains

It would be weird for the inventories to be called PF* if they're inventories of VF. But mainly: why the custom resource classes? The way "resourceless RP" + "same_subtree" is designed to work is best explained if I model your use case with standard resource classes instead: CN | +---NIC1 (trait: I_AM_A_NIC) | | | +-----PF1_1 (trait: CUSTOM_PHYSNET1, inventory: VF=4) | | | +-----PF1_2 (trait: CUSTOM_PHYSNET2, inventory: VF=4) | +---NIC2 (trait: I_AM_A_NIC) | +-----PF2_1 (trait: CUSTOM_PHYSNET1, inventory: VF=4) | +-----PF2_2 (trait: CUSTOM_PHYSNET2, inventory: VF=4) Now if I say: ?resources_T1=VF:1 &required_T1=CUSTOM_PHYSNET1 &resources_T2=VF:1 &required_T2=CUSTOM_PHYSNET2 &required_T3=I_AM_A_NIC &same_subtree=','.join([suffix for suffix in suffixes if suffix.startswith('_T')]) (i.e. '_T1,_T2,_T3') ...then I'll get two candidates: - {PF1_1: VF=1, PF1_2: VF=1} <== i.e. both from NIC1 - {PF2_1: VF=1, PF2_2: VF=1} <== i.e. both from NIC2 ...and no candidates where one VF is from each NIC. IIUC this is how you wanted it. ============== With the custom resource classes, I'm having a hard time understanding the model. How unique are the _PF$N bits? Do they repeat (a) from one NIC to the next? (b) From one host to the next? (c) Never? The only thing that begins to make sense is (a), because (b) and (c) would lead to skittles. So assuming (a), the model would look something like: CN | +---NIC1 (trait: I_AM_A_NIC) | | | +-----PF1_1 (trait: CUSTOM_PHYSNET1, inventory: CUSTOM_PF1_VF=4) | | | +-----PF1_2 (trait: CUSTOM_PHYSNET2, inventory: CUSTOM_PF2_VF=4) | +---NIC2 (trait: I_AM_A_NIC) | +-----PF2_1 (trait: CUSTOM_PHYSNET1, inventory: CUSTOM_PF1_VF=4) | +-----PF2_2 (trait: CUSTOM_PHYSNET2, inventory: CUSTOM_PF2_VF=4) Now you could get the same result with (essentially) the same request as above: ?resources_T1=CUSTOM_PF1_VF:1 &required_T1=CUSTOM_PHYSNET1 &resources_T2=CUSTOM_PF2_VF:1 &required_T2=CUSTOM_PHYSNET2 &required_T3=I_AM_A_NIC &same_subtree=','.join([suffix for suffix in suffixes if suffix.startswith('_T')]) (i.e. '_T1,_T2,_T3') ==> - {PF1_1: CUSTOM_PF1_VF=1, PF1_2: CUSTOM_PF2_VF=1} - {PF2_1: CUSTOM_PF1_VF=1, PF2_2: CUSTOM_PF2_VF=1} ...except that in this model, PF$N corresponds to PHYSNET$N, so you wouldn't actually need the required_T$N=CUSTOM_PHYSNET$N to get the same result: ?resources_T1=CUSTOM_PF1_VF:1 &resources_T2=CUSTOM_PF2_VF:1 &required_T3=I_AM_A_NIC &same_subtree=','.join([suffix for suffix in suffixes if suffix.startswith('_T')]) (i.e. '_T1,_T2,_T3') ...because you're effectively encoding the physnet into the RC. Which is not good IMO. But either way... the individual PF RPs as children nodes? ...if you want to be able to request this kind of affinity, then yes, you do (unless there's some consumable resource on the NIC, in which case it's not resourceless, but the spirit is the same). This is exactly what these features are being designed for. Thanks, efried .

Nadathur, Sundar

1:37 p.m.

On 5/8/2019 2:31 PM, Eric Fried wrote:

...

Sundar-

...
I have a set of compute hosts, each with several NICs of type T. Each NIC has a set of PFs: PF1, PF2, .... Each PF is a resource provider, and each has a separate custom RC: CUSTOM_RC_PF1, CUSTOM_RC_PF2, ... . The VFs are inventories of the associated PF's RC. Provider networks etc. are traits on that PF.

It would be weird for the inventories to be called PF* if they're inventories of VF. I am focusing mainly on the concepts for now, not on the names. But mainly: why the custom resource classes? This is as elaborate an example as I could cook up. IRL, we may need some custom RC, but maybe not one for each PF type. The way "resourceless RP" + "same_subtree" is designed to work is best explained if I model your use case with standard resource classes instead:

CN | +---NIC1 (trait: I_AM_A_NIC) | | | +-----PF1_1 (trait: CUSTOM_PHYSNET1, inventory: VF=4) | | | +-----PF1_2 (trait: CUSTOM_PHYSNET2, inventory: VF=4) | +---NIC2 (trait: I_AM_A_NIC) | +-----PF2_1 (trait: CUSTOM_PHYSNET1, inventory: VF=4) | +-----PF2_2 (trait: CUSTOM_PHYSNET2, inventory: VF=4)

Now if I say:

?resources_T1=VF:1 &required_T1=CUSTOM_PHYSNET1 &resources_T2=VF:1 &required_T2=CUSTOM_PHYSNET2 &required_T3=I_AM_A_NIC &same_subtree=','.join([suffix for suffix in suffixes if suffix.startswith('_T')]) (i.e. '_T1,_T2,_T3')

...then I'll get two candidates:

- {PF1_1: VF=1, PF1_2: VF=1} <== i.e. both from NIC1 - {PF2_1: VF=1, PF2_2: VF=1} <== i.e. both from NIC2

...and no candidates where one VF is from each NIC.

IIUC this is how you wanted it.

Yes. The examples in the storyboard [1] for NUMA affinity use group numbers. If that were recast to use named groups, and we wanted NUMA affinity apart from device colocation, would that not require a different name than T? In short, if you want to express 2 different affinities/groupings, perhaps we need to use a name with 2 parts, and use 2 different same_subtree clauses. Just pointing out the implications. BTW, I noticed there is a standard RC for NIC VFs [2]. [1] https://storyboard.openstack.org/#!/story/2005575 [2] https://github.com/openstack/os-resource-classes/blob/master/os_resource_cla...

...

==============

With the custom resource classes, I'm having a hard time understanding the model. How unique are the _PF$N bits? Do they repeat (a) from one NIC to the next? (b) From one host to the next? (c) Never?

The only thing that begins to make sense is (a), because (b) and (c) would lead to skittles. So assuming (a), the model would look something like: Yes, (a) is what I had in mind. CN | +---NIC1 (trait: I_AM_A_NIC) | | | +-----PF1_1 (trait: CUSTOM_PHYSNET1, inventory: CUSTOM_PF1_VF=4) | | | +-----PF1_2 (trait: CUSTOM_PHYSNET2, inventory: CUSTOM_PF2_VF=4) | +---NIC2 (trait: I_AM_A_NIC) | +-----PF2_1 (trait: CUSTOM_PHYSNET1, inventory: CUSTOM_PF1_VF=4) | +-----PF2_2 (trait: CUSTOM_PHYSNET2, inventory: CUSTOM_PF2_VF=4)

Now you could get the same result with (essentially) the same request as above:

?resources_T1=CUSTOM_PF1_VF:1 &required_T1=CUSTOM_PHYSNET1 &resources_T2=CUSTOM_PF2_VF:1 &required_T2=CUSTOM_PHYSNET2 &required_T3=I_AM_A_NIC &same_subtree=','.join([suffix for suffix in suffixes if suffix.startswith('_T')]) (i.e. '_T1,_T2,_T3')

==>

- {PF1_1: CUSTOM_PF1_VF=1, PF1_2: CUSTOM_PF2_VF=1} - {PF2_1: CUSTOM_PF1_VF=1, PF2_2: CUSTOM_PF2_VF=1}

...except that in this model, PF$N corresponds to PHYSNET$N, so you wouldn't actually need the required_T$N=CUSTOM_PHYSNET$N to get the same result:

?resources_T1=CUSTOM_PF1_VF:1 &resources_T2=CUSTOM_PF2_VF:1 &required_T3=I_AM_A_NIC &same_subtree=','.join([suffix for suffix in suffixes if suffix.startswith('_T')]) (i.e. '_T1,_T2,_T3')

...because you're effectively encoding the physnet into the RC. Which is not good IMO.

But either way...

...
Do I have to create a 'resourceless RP' for the NIC card that contains the individual PF RPs as children nodes?

...if you want to be able to request this kind of affinity, then yes, you do (unless there's some consumable resource on the NIC, in which case it's not resourceless, but the spirit is the same). This is exactly what these features are being designed for.

Great. Thank you very much for the detailed reply. Regards, Sundar

...

Thanks, efried .

Eric Fried

10:39 p.m.

Sundar-

...

Yes. The examples in the storyboard [1] for NUMA affinity use group numbers. If that were recast to use named groups, and we wanted NUMA affinity apart from device colocation, would that not require a different name than T? In short, if you want to express 2 different affinities/groupings, perhaps we need to use a name with 2 parts, and use 2 different same_subtree clauses. Just pointing out the implications.

That's correct. If we wanted two groupings... [repeating diagram for context] CN | +---NIC1 (trait: I_AM_A_NIC) | | | +-----PF1_1 (trait: CUSTOM_PHYSNET1, inventory: VF=4) | | | +-----PF1_2 (trait: CUSTOM_PHYSNET2, inventory: VF=4) | +---NIC2 (trait: I_AM_A_NIC) | +-----PF2_1 (trait: CUSTOM_PHYSNET1, inventory: VF=4) | +-----PF2_2 (trait: CUSTOM_PHYSNET2, inventory: VF=4) ?resources_TA1=VF:1&required_TA1=CUSTOM_PHYSNET1 &resources_TA2=VF:1&required_TA2=CUSTOM_PHYSNET2 &required_TA3=I_AM_A_NIC &same_subtree=','.join([ suffix for suffix in suffixes if suffix.startswith('_TA')]) # (i.e. '_TA1,_TA2,_TA3') &resources_TB1=VF:1&required_TB1=CUSTOM_PHYSNET1 &resources_TB2=VF:1&required_TB2=CUSTOM_PHYSNET2 &required_TB3=I_AM_A_NIC &same_subtree=','.join([ suffix for suffix in suffixes if suffix.startswith('_TB')]) # (i.e. '_TB1,_TB2,_TB3') This would give us four candidates: - One where TA* is under NIC1 and TB* is under NIC2 - One where TB* is under NIC1 and TA* is under NIC2 - One where everything is under NIC1 - One where everything is under NIC2 This of course leads to some nontrivial questions, like: - How do we express these groupings from the operator-/user-facing sources (flavor, port, device_profile, etc.)? Especially when different pieces come from different sources but still need to be affined to each other. This is helped by allowing named as opposed to autonumbered suffixes, which is why we're doing that, but it's still going to be tricky to do in practice. - What if we want to express anti-affinity, i.e. limit the response to just the first two candidates? We discussed being able to say something like same_subtree=_TA3,!_TB3, but decided to defer that design/implementation for now. If you want this kind of thing in Train, you'll have to filter post-Placement. Thanks, efried .

Ed Leafe

4 May 4 May

2:02 a.m.

On May 3, 2019, at 8:22 AM, Tetsuro Nakamura <tetsuro.nakamura.bc@hco.ntt.co.jp> wrote:

...

I have looked into recursive SQL CTE (common table expression) feature which help us treat subtree easily in adjacency list model in a experimental patch [2], but unfortunately it looks like the feature is still experimental in MySQL, and we don't want to query like this per every candidates, do we? :(

At the risk of repeating myself, SQL doesn’t model the relationships among entities involved with either nested providers or shared providers. These relationships are modeled simply in a graph database, avoiding the gymnastics needed to fit them into a relational DB. I have a working model of Placement that has already solved nested providers (any depth), shared providers, project usages, and more. If you have time while at PTG, grab me and I’d be happy to demonstrate. -- Ed Leafe

Balázs Gibizer

30 Apr 30 Apr

10:27 p.m.

On Mon, Apr 29, 2019 at 11:38 PM, Eric Fried <openstack@fried.cc> wrote:

...

...
Jay suggested extending the JSON schema to allow groups that are names like resources_compute, required_network. That might allow for some conventions to emerge but still requires some measure of knowledge from the participants.

I think this is a good idea to pursue, because it gives us a way to predefine (by convention) what the groups are called, as opposed to having them be automatically, arbitrarily, unpredictably numbered. It'll still break down in more complex scenarios where, say, there's more than one device_group with different affinity requirements; but it could work for the simpler setups.

I support this idea. Today The RequestGroup contains a requester_id field[1] to map the numbered group back to the neutron port (cyborg dev) it is originated from. If the group can be named instead of only numbered then this mapping can be encoded into the name of the group like resources_port_<port_uuid>. This would also made sure that name of the group is unique and more importantly it is stable (today we generate the number of the numbered group originated from neutron port) and that helps troubleshooting. Cheers, gibi [1]https://github.com/openstack/nova/blob/ce5ef763b58cad09440e0da67733ce5780687...

Jay Pipes

22 Apr 22 Apr

12:32 p.m.

On 04/09/2019 08:36 AM, Chris Dent wrote:

...

Spec: https://review.openstack.org/650476

From the commit message:

To support NUMA and similar concepts, this proposes the ability to request resources from different providers nested under a common subtree (below the root provider).

There's much in the feature described by the spec and the surrounding context that is frequently a source of contention in the placement group, so working through this spec is probably going to require some robust discussion. Doing most of that before the PTG will help make sure we're not going in circles in person.k

Some of the areas of potential contention:

* Adequate for limited but maybe not all use case solutions * Strict trait constructionism * Evolving the complexity of placement solely for the satisfaction of hardware representation in Nova * Inventory-less resource providers * Developing new features in placement before existing features are fully used in client services * Others?

I list this not because they are deal breakers or the only thing that matters, but because they have presented stumbling blocks in the past and we may as well work to address them (or make an agreement to punt them until later) otherwise there will be lingering dread.

And, beyond all that squishy stuff, there is the necessary discussion over the solution described in the spec. There are several alternatives listed in the spec, and a few more in the comments. We'd like to figure out the best solution that can actually be done in a reasonable amount of time, not the best solution in the absolute.

How about making the API something like this? GET /allocation_candidates?resources1=PCPU:2,MEMORY_MB:2048\ &resources2=SRIOV_NET_VF:1\ &group_resources=2:1 the &group_resources query parameter would indicate the request group identifiers to group resources with. For example, group_resources=2:1 means "the providers of resources in request group '2' should be in the same tree as providers of resources in request group '1'" No traits needed at all and the API call has more imperative clarity, since the caller is saying exactly how they expect the placement service to group related resources. For "anti-affinity", just add a ! before the group identifier. For example: GET /allocation_candidates?resources1=PCPU:2,MEMORY_MB:2048\ &resources2=SRIOV_NET_VF:1\ &group_resources=2:!1 Would mean "the providers of resources in request group '2' should NOT be in the same tree as providers of resources in request group '1'" Best, -jay [0] Also, there's a single regex that currently mandates that a request group identifier needs to be a number: https://github.com/openstack/placement/blob/master/placement/lib.py#L30-L32 We could easily change that and allow named identifiers instead of numbers so that we could do, for example: GET /allocation_candidates?resources_compute=PCPU:2,MEMORY_MB:2048\ &resources_network=SRIOV_NET_VF:1\ &group_resources=_compute:_network

2395

Age (days ago)

2425

Last active (days ago)

List overview

Download

37 comments

10 participants

participants (10)

Alex Xu
Balázs Gibizer
Chris Dent
Ed Leafe
Eric Fried
Jay Pipes
Nadathur, Sundar
Sean Mooney
Sylvain Bauza
Tetsuro Nakamura