[placement][nova][ptg] Resourceless trait filters

Chris Dent

9 Apr 2019 9 Apr '19

5:22 p.m.

From the etherpad [1] (the cross project one):

* This came up with the effort to add a multiattach capability filter, at https://review.openstack.org/#/c/645316/1/nova/compute/api.py@1098 * The problem: we want to be able to request allocation candidates filtered by a trait that exists on the root provider; but * (a) the root provider may provide no resources (eventually - e.g. CPU/mem in NUMA providers, shared disk); and/or * (My brain hurts from the concept for a provider that provides nothing. Perhaps it provides something we aren't remembering to count?) * (b) we are using only numbered groups and would have to guess where to stuff the `required` - again probably based on looking for the VCPU/mem resources, which, as above, may wind up not on the root provider. This, basically, is how/when do we deal with resource providers that have traits but have no classes of inventory of their own (all of it is on their descendants). The "My brain hurts" comment above is mine. The discussion on the review above (good backgrounder for this thread) had little to do with the immediate need of the proposed feature. It was a concern about how things will work in the future. So there are two salient questions (and presumably plenty of other sidebars): * How should it work? * What is the timeline for when it needs to work? [1] https://etherpad.openstack.org/p/ptg-train-xproj-nova-placement -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Show replies by date

Balázs Gibizer

10 Apr 10 Apr

9:24 a.m.

On Tue, Apr 9, 2019 at 7:22 PM, Chris Dent <cdent+os@anticdent.org> wrote:

...

From the etherpad [1] (the cross project one):

* This came up with the effort to add a multiattach capability filter, at https://review.openstack.org/#/c/645316/1/nova/compute/api.py@1098

Just for completness. In the above linked example, requesting the trait was the problem. It was requested in a separate numbered request group that had no requested resources. Today that is invalid in the placement a_c API.

...

* The problem: we want to be able to request allocation candidates filtered by a trait that exists on the root provider; but

* (a) the root provider may provide no resources (eventually - e.g. CPU/mem in NUMA providers, shared disk); and/or

* (My brain hurts from the concept for a provider that provides nothing. Perhaps it provides something we aren't remembering to count?)

Since providers can be nested a provider can be a strucutral element in a nested tree without providing resources. This happens in the bandwidth case where the agent RPs are there to connect the device RPs to neutron agents. It is bit like a resource provider partitioning scheme as well as RP ownership definition implemented by nested subtrees. To avoid the trait on an empty RP problem in this case all the traits are added to the device RPs even if some of the traits (CUSTOM_VNIC_TYPE_XXX) are more logically belong to the agnet RP. I'm currently happy with the resulting RP tree and trait handling in the bandwidth model.

...

* (b) we are using only numbered groups and would have to guess where to stuff the `required` - again probably based on looking for the VCPU/mem resources, which, as above, may wind up not on the root provider.

This, basically, is how/when do we deal with resource providers that have traits but have no classes of inventory of their own (all of it is on their descendants). The "My brain hurts" comment above is mine.

I think the possible solution to move the trait down to the RP that provides the resources that has a capability described by the trait could be a simple solution that does not require placement changes.

...

The discussion on the review above (good backgrounder for this thread) had little to do with the immediate need of the proposed feature. It was a concern about how things will work in the future. So there are two salient questions (and presumably plenty of other sidebars):

* How should it work? * What is the timeline for when it needs to work?

I think this work should be driven by a specific need, like the NUMA modelling in placement by nova. What is the timeline of NUMA modelling in placement? Cheers, gibi

...

[1] https://etherpad.openstack.org/p/ptg-train-xproj-nova-placement

-- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Eric Fried

1:38 p.m.

...

Since providers can be nested a provider can be a strucutral element in a nested tree without providing resources.

This. I wouldn't get too hung up on "a resource provider that doesn't provide resources". "Resource provider" is just a name of a thing. It doesn't (can't) exhaustively define what the thing is and does.

...

To avoid the trait on an empty RP problem in this case all the traits are added to the device RPs even if some of the traits (CUSTOM_VNIC_TYPE_XXX) are more logically belong to the agnet RP.

Noting that I have at best a cursory understanding of all things network, this seems okay to me; I can get my brain around a "network device" having a "VNIC type" trait. However...

...

I think the possible solution to move the trait down to the RP that provides the resources that has a capability described by the trait

In Matt's patch, the traits represent capabilities of the virt driver. I *suppose* you could argue to attach traits like COMPUTE_NET_ATTACH_INTERFACE to providers of network resources, or COMPUTE_VOLUME_EXTEND to providers of disk resources (even those are pretty awkward). But where would you put COMPUTE_TRUSTED_CERTS? And let's please not contrive non-resource resources to satisfy architectural purity.

...

I think this work should be driven by a specific need, like the NUMA modelling in placement by nova. What is the timeline of NUMA modelling in placement?

Proposes NUMA topology with RPs: https://review.openstack.org/552924 Proposes NUMA affinity for vGPUs: https://review.openstack.org/650963 Spec: Allocation Candidates: Subtree Affinity: https://review.openstack.org/650476 and a bunch of the other specs in flight are affected by NUMA modeling. We've been building up to this for years. If it doesn't happen in Train, it'll happen in U. While I agree we shouldn't solve a problem until it's a problem, I also don't want to another situation like reshaper, where suddenly at the end of the cycle we realize/remember this nontrivial thing we need to handle. efried .

Balázs Gibizer

2:54 p.m.

On Wed, Apr 10, 2019 at 3:38 PM, Eric Fried <openstack@fried.cc> wrote:

...

...
Since providers can be nested a provider can be a strucutral element in a nested tree without providing resources.

This.

I wouldn't get too hung up on "a resource provider that doesn't provide resources". "Resource provider" is just a name of a thing. It doesn't (can't) exhaustively define what the thing is and does.

...
To avoid the trait on an empty RP problem in this case all the traits are added to the device RPs even if some of the traits (CUSTOM_VNIC_TYPE_XXX) are more logically belong to the agnet RP.

Noting that I have at best a cursory understanding of all things network, this seems okay to me; I can get my brain around a "network device" having a "VNIC type" trait. However...

...
I think the possible solution to move the trait down to the RP that provides the resources that has a capability described by the trait

In Matt's patch, the traits represent capabilities of the virt driver. I *suppose* you could argue to attach traits like COMPUTE_NET_ATTACH_INTERFACE to providers of network resources, or COMPUTE_VOLUME_EXTEND to providers of disk resources (even those are pretty awkward). But where would you put COMPUTE_TRUSTED_CERTS? And let's please not contrive non-resource resources to satisfy architectural purity.

Good point. I agree that the COMPUTE_TRUSTED_CERTS logically belong to the root RP of the compute node. So we have a specific use case: COMPUTE_TRUSTED_CERTS + NUMA requires placement to support traits on RPs that are not providing resources. Will in this case VCPU and MEMORY_MB always requested in the unnumbered group still? Or there are cases when those resources are requested in a numbered group? If the alter then we have a use case to support request groups in a_c query that only requires a trait COMPUTE_TRUSTED_CERTS but does not require any resoruces.

...

...
I think this work should be driven by a specific need, like the NUMA modelling in placement by nova. What is the timeline of NUMA modelling in placement?

Proposes NUMA topology with RPs: https://review.openstack.org/552924 Proposes NUMA affinity for vGPUs: https://review.openstack.org/650963 Spec: Allocation Candidates: Subtree Affinity: https://review.openstack.org/650476

and a bunch of the other specs in flight are affected by NUMA modeling.

We've been building up to this for years. If it doesn't happen in Train, it'll happen in U. While I agree we shouldn't solve a problem until it's a problem, I also don't want to another situation like reshaper, where suddenly at the end of the cycle we realize/remember this nontrivial thing we need to handle.

I don't want to delay the developement of this feature. I would like to see that we implement a certain feature in placement because it is needed for specific use case for a specific consumer of placement. So if the above discovered use case (COMPUTE_TRUSTED_CERTS + NUMA) are valid then I have no problem adding the necessary changes to placement to make this use case doable in nova. Cheers, gibi

...

efried .

Eric Fried

3:37 p.m.

...

So we have a specific use case: COMPUTE_TRUSTED_CERTS + NUMA requires placement to support traits on RPs that are not providing resources. Will in this case VCPU and MEMORY_MB always requested in the unnumbered group still? Or there are cases when those resources are requested in a numbered group?

That'll depend on the outcome of the various NUMA modeling discussions; but as soon as we need to be able to request resources from more than one NUMA node, it seems certain that we'll need to use numbered groups. efried .

Balázs Gibizer

11 Apr 11 Apr

12:42 p.m.

On Wed, Apr 10, 2019 at 5:37 PM, Eric Fried <openstack@fried.cc> wrote:

...

...
So we have a specific use case: COMPUTE_TRUSTED_CERTS + NUMA requires placement to support traits on RPs that are not providing resources. Will in this case VCPU and MEMORY_MB always requested in the unnumbered group still? Or there are cases when those resources are requested in a numbered group?

That'll depend on the outcome of the various NUMA modeling discussions; but as soon as we need to be able to request resources from more than one NUMA node, it seems certain that we'll need to use numbered groups.

Then we need two steps: 1) handle RPs that only have traits but not resources 2) allow having the numbered request groups specify only a trait without requesting resources. Cheers, gibi

...

efried .

Eric Fried

6:01 p.m.

...

Then we need two steps: 1) handle RPs that only have traits but not resources 2) allow having the numbered request groups specify only a trait without requesting resources.

Nothing stops me from traiting an inventoryless provider today, right? Do we actually need to involve numbered request groups? Does a concept of required_in_tree=$trait_list satisfy the requirements? efried .

Balázs Gibizer

15 Apr 15 Apr

8:14 a.m.

On Thu, Apr 11, 2019 at 8:01 PM, Eric Fried <openstack@fried.cc> wrote:

...

...
Then we need two steps: 1) handle RPs that only have traits but not resources 2) allow having the numbered request groups specify only a trait without requesting resources.

Nothing stops me from traiting an inventoryless provider today, right?

Does GET /allocation_candidates query handles them properly?

...

Do we actually need to involve numbered request groups? Does a concept of required_in_tree=$trait_list satisfy the requirements?

I don't know. I guess I have to read required_in_tree=$trait_list spec. Please ignore me. cheers, gibi

...

efried .

Eric Fried

10:48 a.m.

...

On Apr 15, 2019, at 03:14, Balázs Gibizer <balazs.gibizer@ericsson.com> wrote:

On Thu, Apr 11, 2019 at 8:01 PM, Eric Fried <openstack@fried.cc> wrote:

...
...
Then we need two steps: 1) handle RPs that only have traits but not resources 2) allow having the numbered request groups specify only a trait without requesting resources.

Nothing stops me from traiting an inventoryless provider today, right?

Does GET /allocation_candidates query handles them properly?

No, I was just pointing out that "step 1" was okay, depending on what you mean by "handle".

...

...
Do we actually need to involve numbered request groups? Does a concept of required_in_tree=$trait_list satisfy the requirements?

I don't know. I guess I have to read required_in_tree=$trait_list spec. Please ignore me.

There's no such spec yet. The above is a sincere question that could lead to one. The GET /a_c syntax required_in_tree=$trait_list would cause the result to include results where the listed traits are *anywhere* is the tree, even if the providers having those traits don't provide resources to the request. This should be fairly simple to implement. I'm asking us to ponder whether it will satisfy our use cases here. Another option is something like root_required=$trait_list (the traits are only on the root, even when the root doesn't provide resources to the request). This satisfies the immediate use cases, but it's obviously more restrictive, and I don't think it's any easier to implement. Neither of these is perfect, probably. But also neither requires doing anything with numbered groups. Did you have something else in mind for that? efried .

Balázs Gibizer

11:38 a.m.

On Mon, Apr 15, 2019 at 12:48 PM, Eric Fried <openstack@fried.cc> wrote:

...

...
On Apr 15, 2019, at 03:14, Balázs Gibizer <balazs.gibizer@ericsson.com> wrote:

On Thu, Apr 11, 2019 at 8:01 PM, Eric Fried <openstack@fried.cc> wrote:

...
...
Then we need two steps: 1) handle RPs that only have traits but not resources 2) allow having the numbered request groups specify only a trait without requesting resources.

Nothing stops me from traiting an inventoryless provider today, right?

Does GET /allocation_candidates query handles them properly?

No, I was just pointing out that "step 1" was okay, depending on what you mean by "handle".

...
...
Do we actually need to involve numbered request groups? Does a concept of required_in_tree=$trait_list satisfy the requirements?

I don't know. I guess I have to read required_in_tree=$trait_list spec. Please ignore me.

There's no such spec yet. The above is a sincere question that could lead to one.

Sorry there was too many mails for me to process. So I assumed I overlooked this one as well.

...

The GET /a_c syntax

required_in_tree=$trait_list

would cause the result to include results where the listed traits are *anywhere* is the tree, even if the providers having those traits don't provide resources to the request. This should be fairly simple to implement. I'm asking us to ponder whether it will satisfy our use cases here.

That would be yet another new way to require something in a different way that we requiring things so far. We had initially the unumbered group (we didn't even called it that way from the start) with resources and traits (and aggregates), then we realized that we needed more granularity to express two sets of resources fulfilling different, even contradicting traits, so we created numbered request groups. Now we have the next complication due to traits are not on the same RP we are requesting resources from but numbered groups are fulfilled from a single RP only. So what we need to solve is. Two (or more) sets of resources where the different sets requires different, contradicting traits, in a setup where the trait is not on the RP where resource inventory is. compute RP | | |____ OVS agent RP | | * CUSTOM_VNIC_TYPE_NORMAL | | | |___________ br-int dev RP | * CUSTOM_PHYSNET_PHYSNET0 | * NET_BW_EGR_KILOBIT_PER_SEC: 1000 | | |____ SRIOV agent RP | | * CUSTOM_VNIC_TYPE_DIRECT | | | | | |___________ esn1 dev RP | | * CUSTOM_PHYSNET_PHYSNET0 | | * NET_BW_EGR_KILOBIT_PER_SEC: 10000 | | | |___________ esn2 dev RP | * CUSTOM_PHYSNET_PHYSNET1 | * NET_BW_EGR_KILOBIT_PER_SEC: 20000 Then having two neutron ports in a server create request: * port-normal: "resource_request": { "resources": { orc.NET_BW_EGR_KILOBIT_PER_SEC: 1000}, "required": ["CUSTOM_PHYSNET0", "CUSTOM_VNIC_TYPE_NORMAL"] * port-direct: "resource_request": { "resources": { orc.NET_BW_EGR_KILOBIT_PER_SEC: 2000}, "required": ["CUSTOM_PHYSNET0", "CUSTOM_VNIC_TYPE_DIRECT"] Today this will become two numbered request groups. But as the numbered groups are always fulfilled by a single RP and therefore the a_c result is empty due to CUSTOM_VNIC_TYPE_ traits. The 'required_in_tree' in this case would need to contain both CUSTOM_VNIC_TYPE_NORMAL and CUSTOM_VNIC_TYPE_DIRECT which would be either be contradiction (if we interpret it in a way that both needs to be fulfilled from the same RP) or would not express the port request (if we interpret in a way that the trait needs to be in the tree somewhere).

...

Another option is something like root_required=$trait_list (the traits are only on the root, even when the root doesn't provide resources to the request). This satisfies the immediate use cases, but it's obviously more restrictive, and I don't think it's any easier to implement.

the 'root_required' would not solve the above example either as the above example does not depend on the root RP at all.

...

Neither of these is perfect, probably. But also neither requires doing anything with numbered groups. Did you have something else in mind for that?

See above. I think the above example is a bit more generic than the NUMA based one due to the fact that it does not depend on the root having traits. Also it shows the why the numbered groups needs to be considered when we talk about traits not being on the RP that provides the requested resources. I would try to extend the api in a way that extends the existing tools, the groups, to be able to support requests against the above tree. How to do that? I don't know. I need to think about it. But we might need to revisit the decision to restring numbered group to a single RP. Cheers, gibi

...

efried .

Alex Xu

16 Apr 16 Apr

3:15 p.m.

Balázs Gibizer <balazs.gibizer@ericsson.com> 于2019年4月10日周三下午10:59写道：

...

On Wed, Apr 10, 2019 at 3:38 PM, Eric Fried <openstack@fried.cc> wrote:

...
...
Since providers can be nested a provider can be a strucutral element in a nested tree without providing resources.

This.

I wouldn't get too hung up on "a resource provider that doesn't provide resources". "Resource provider" is just a name of a thing. It doesn't (can't) exhaustively define what the thing is and does.

...
To avoid the trait on an empty RP problem in this case all the traits are added to the device RPs even if some of the traits (CUSTOM_VNIC_TYPE_XXX) are more logically belong to the agnet RP.

Noting that I have at best a cursory understanding of all things network, this seems okay to me; I can get my brain around a "network device" having a "VNIC type" trait. However...

...
I think the possible solution to move the trait down to the RP that provides the resources that has a capability described by the trait

In Matt's patch, the traits represent capabilities of the virt driver. I *suppose* you could argue to attach traits like COMPUTE_NET_ATTACH_INTERFACE to providers of network resources, or COMPUTE_VOLUME_EXTEND to providers of disk resources (even those are pretty awkward). But where would you put COMPUTE_TRUSTED_CERTS? And let's please not contrive non-resource resources to satisfy architectural purity.

Good point. I agree that the COMPUTE_TRUSTED_CERTS logically belong to the root RP of the compute node. So we have a specific use case: COMPUTE_TRUSTED_CERTS + NUMA requires placement to support traits on RPs that are not providing resources. Will in this case VCPU and MEMORY_MB always requested in the unnumbered group still? Or there are cases when those resources are requested in a numbered group? If the alter then we have a use case to support request groups in a_c query that only requires a trait COMPUTE_TRUSTED_CERTS but does not require any resoruces.

Maybe the problem isn't "traits on RP that are not providing resources", the problem is we don't count the resources, those traits are attach to, in the compute node. So all those virt driver capability traits are talking about the capability of VM (or say VM resources). But we don't count the number of VM, so we won't have inventory for the VM. The compute node is using the VM(or BM, but just use VM as example here) to provide the compute resources. So I think it is ok to attach those traits to the RP which provides VCPU resource.

...

...
...
I think this work should be driven by a specific need, like the NUMA modelling in placement by nova. What is the timeline of NUMA modelling in placement?

Proposes NUMA topology with RPs: https://review.openstack.org/552924 Proposes NUMA affinity for vGPUs: https://review.openstack.org/650963 Spec: Allocation Candidates: Subtree Affinity: https://review.openstack.org/650476

and a bunch of the other specs in flight are affected by NUMA modeling.

We've been building up to this for years. If it doesn't happen in Train, it'll happen in U. While I agree we shouldn't solve a problem until it's a problem, I also don't want to another situation like reshaper, where suddenly at the end of the cycle we realize/remember this nontrivial thing we need to handle.

I don't want to delay the developement of this feature. I would like to see that we implement a certain feature in placement because it is needed for specific use case for a specific consumer of placement. So if the above discovered use case (COMPUTE_TRUSTED_CERTS + NUMA) are valid then I have no problem adding the necessary changes to placement to make this use case doable in nova.

Cheers, gibi

...
efried .

Balázs Gibizer

3:25 p.m.

On Tue, Apr 16, 2019 at 5:15 PM, Alex Xu <soulxu@gmail.com> wrote:

...

Balázs Gibizer <balazs.gibizer@ericsson.com> 于2019年4月10日周三下午10:59写道：

...
...
...
Since providers can be nested a provider can be a strucutral element in a nested tree without providing resources.

This.

I wouldn't get too hung up on "a resource provider that doesn't provide resources". "Resource provider" is just a name of a thing. It doesn't (can't) exhaustively define what the thing is and does.

...
To avoid the trait on an empty RP problem in this case all the traits are added to the device RPs even if some of the traits (CUSTOM_VNIC_TYPE_XXX) are more logically belong to the agnet RP.

Noting that I have at best a cursory understanding of all things network, this seems okay to me; I can get my brain around a "network device" having a "VNIC type" trait. However...

...
I think the possible solution to move the trait down to the RP

...
...
provides the resources that has a capability described by the

On Wed, Apr 10, 2019 at 3:38 PM, Eric Fried <openstack@fried.cc> wrote: that trait

...
In Matt's patch, the traits represent capabilities of the virt driver. I *suppose* you could argue to attach traits like COMPUTE_NET_ATTACH_INTERFACE to providers of network resources, or COMPUTE_VOLUME_EXTEND to providers of disk resources (even those

are

...
pretty awkward). But where would you put COMPUTE_TRUSTED_CERTS? And let's please not contrive non-resource resources to satisfy architectural purity.

Good point. I agree that the COMPUTE_TRUSTED_CERTS logically belong to the root RP of the compute node. So we have a specific use case: COMPUTE_TRUSTED_CERTS + NUMA requires placement to support traits on RPs that are not providing resources. Will in this case VCPU and MEMORY_MB always requested in the unnumbered group still? Or there are cases when those resources are requested in a numbered group? If the alter then we have a use case to support request groups in a_c query that only requires a trait COMPUTE_TRUSTED_CERTS but does not require any resoruces.

Maybe the problem isn't "traits on RP that are not providing resources", the problem is we don't count the resources, those traits are attach to, in the compute node. So all those virt driver capability traits are talking about the capability of VM (or say VM resources). But we don't count the number of VM, so we won't have inventory for the VM. The compute node is using the VM(or BM, but just use VM as example here) to provide the compute resources. So I think it is ok to attach those traits to the RP which provides VCPU resource.

I'm not sure I understand your proposal. Would you introduce a VM resource and then allocate 1 of that resource for each VM? Cheers, gibi

Ed Leafe

10 Apr 10 Apr

4:51 p.m.

On Apr 9, 2019, at 12:22 PM, Chris Dent <cdent+os@anticdent.org> wrote:

...

* (My brain hurts from the concept for a provider that provides nothing. Perhaps it provides something we aren't remembering to count?)

This was a big bone of contention back when Traits were first being discussed: they can only be applied to RPs. So if you had a single physical PCI device that provided virtual functions, and some of those VFs were, say, private net and the others were public, you had to create intermediary RPs between the PCI RP and the VFs so that you could tag those RPs with the correct trait to distinguish them. That brain hurt is very much why I argued against this restriction back then. But it's what we ended up with, so those "virtual" RPs are with us for good. If it helps, think of them as not providing resources, but providing other resource providers. Does that hurt a little less? -- Ed Leafe

Jay Pipes

22 Apr 22 Apr

7:23 p.m.

On 04/10/2019 12:51 PM, Ed Leafe wrote:

...

On Apr 9, 2019, at 12:22 PM, Chris Dent <cdent+os@anticdent.org> wrote:

...
* (My brain hurts from the concept for a provider that provides nothing. Perhaps it provides something we aren't remembering to count?)

This was a big bone of contention back when Traits were first being discussed: they can only be applied to RPs. So if you had a single physical PCI device that provided virtual functions, and some of those VFs were, say, private net and the others were public, you had to create intermediary RPs between the PCI RP and the VFs so that you could tag those RPs with the correct trait to distinguish them.

That brain hurt is very much why I argued against this restriction back then. But it's what we ended up with, so those "virtual" RPs are with us for good. If it helps, think of them as not providing resources, but providing other resource providers.

Does that hurt a little less?

Note that if we used a single "apply aggregate association and trait constraints in a self-and-children manner" policy, everything would be much easier to reason and think about. -jay

Alex Xu

12 Apr 12 Apr

10:17 a.m.

Chris Dent <cdent+os@anticdent.org> 于2019年4月10日周三上午1:31写道：

...

From the etherpad [1] (the cross project one):

* This came up with the effort to add a multiattach capability filter, at https://review.openstack.org/#/c/645316/1/nova/compute/api.py@1098

* The problem: we want to be able to request allocation candidates filtered by a trait that exists on the root provider; but

* (a) the root provider may provide no resources (eventually - e.g. CPU/mem in NUMA providers, shared disk); and/or

* (My brain hurts from the concept for a provider that provides nothing. Perhaps it provides something we aren't remembering to count?)

* (b) we are using only numbered groups and would have to guess where to stuff the `required` - again probably based on looking for the VCPU/mem resources, which, as above, may wind up not on the root provider.

Yea, just take a look at this spec https://review.openstack.org/#/c/647578/4, after we have NUMA in Placement, then the VCPU/mem in the child NUMA RP, but those traits are still in the root provider.

...

This, basically, is how/when do we deal with resource providers that have traits but have no classes of inventory of their own (all of it is on their descendants). The "My brain hurts" comment above is mine.

The discussion on the review above (good backgrounder for this thread) had little to do with the immediate need of the proposed feature. It was a concern about how things will work in the future. So there are two salient questions (and presumably plenty of other sidebars):

* How should it work? * What is the timeline for when it needs to work?

[1] https://etherpad.openstack.org/p/ptg-train-xproj-nova-placement

-- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Chris Dent

16 Apr 16 Apr

5:54 p.m.

On Tue, 9 Apr 2019, Chris Dent wrote:

...

* The problem: we want to be able to request allocation candidates filtered by a trait that exists on the root provider; but

* (a) the root provider may provide no resources (eventually - e.g. CPU/mem in NUMA providers, shared disk); and/or

* (My brain hurts from the concept for a provider that provides nothing. Perhaps it provides something we aren't remembering to count?)

* (b) we are using only numbered groups and would have to guess where to stuff the `required` - again probably based on looking for the VCPU/mem resources, which, as above, may wind up not on the root provider.

I've been trying to think about this (and related nested things) in a general way. In part to see if I can resolve my unease with them, but also to see if there are other angles at which we can come at the problems. Basically look around from all angles. In [1] Tetsuro talks about "spanning" and how we have different spanning policies for different situations. Ed mentions "graphs" as a better model for providers, often. How far away are we from being able to say that all attributes could be treated as spanning whether requests are expressed granularly or not? It seems we are closer in some areas, less so in others, and different depending on how we ask. If we made it true by default, in all contexts, would it help? For example, if we said that "traits always flow down [4]" (the phrase that entered my brain and got me to start this email, "down" in this case is "in the direction of children") then some traits could be on the compute node, but expressed in a numbered request group if that happened to be more convenient. This mental model works well for me, because nested often represents a _containing_ hierarchy [2]. If the "compute RP has no resources to give [...] but it's still the thing exposing traits we want to filter by" [3], if we make it so the children inherit those traits (because they have flowed down and the children are "inside" the thing) things feel a bit saner to me. Would be good if Eric were able to express in more detail why inherit feels "terrible" [3]. It could very well be. Similarly, aggregate membership would flow down as well, because a child is always in its parent's aggregate too because it is inside its parent. A numeric requiredN or member_ofN span would be capped by the resource provider that satisfied resourcesN. As I'm writing this I'm feeling a big sense of "isn't this the obvious way?" and "what am I missing?" and "somewhere there is a DAG lover laughing". So: * Where is it like this? * Where is it not like this? * What now? We need to work out a consistent and relatively easy to explain mental model for this, because we need to be able to talk about it with other people without them being required to re-experience all the mental hurdles we are having to overcome. [1] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/004909.htm... [2] Where this might break down is with interactions with neutron if we're trying to get switches into the mix. However, if the port is considered effectively a part of the node, not really an issue. I think this is okay. A shared provider is one that we care about but is not "inside". Nested means nested. [3] https://review.openstack.org/#/c/645316/1/nova/compute/api.py@1098 [4] A corollary could be "classes of inventory always flow up": If you need a SRIOV_NET_VF, this root resource provider can provide it because it has a great grandchild which has it. -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Eric Fried

7:24 p.m.

...

I'm not sure I understand your proposal. Would you introduce a VM resource and then allocate 1 of that resource for each VM?

This has been proposed before, somewhere: translating max_instances_per_host to an inventory of resource class "VM" on the compute node RP, and including resources:VM=1 in every GET /a_c request. This would solve the class of use cases like:

...

...
So we have a specific use case: COMPUTE_TRUSTED_CERTS + NUMA

but wouldn't help us for:

...

So what we need to solve is. Two (or more) sets of resources where the different sets requires different, contradicting traits, in a setup where the trait is not on the RP where resource inventory is.

compute RP | | |____ OVS agent RP | | * CUSTOM_VNIC_TYPE_NORMAL | | | |___________ br-int dev RP | * CUSTOM_PHYSNET_PHYSNET0 | * NET_BW_EGR_KILOBIT_PER_SEC: 1000 | | |____ SRIOV agent RP | | * CUSTOM_VNIC_TYPE_DIRECT | | | | | |___________ esn1 dev RP | | * CUSTOM_PHYSNET_PHYSNET0 | | * NET_BW_EGR_KILOBIT_PER_SEC: 10000 | | | |___________ esn2 dev RP | * CUSTOM_PHYSNET_PHYSNET1 | * NET_BW_EGR_KILOBIT_PER_SEC: 20000

Then having two neutron ports in a server create request: * port-normal: "resource_request": { "resources": { orc.NET_BW_EGR_KILOBIT_PER_SEC: 1000}, "required": ["CUSTOM_PHYSNET0", "CUSTOM_VNIC_TYPE_NORMAL"]

* port-direct: "resource_request": { "resources": { orc.NET_BW_EGR_KILOBIT_PER_SEC: 2000}, "required": ["CUSTOM_PHYSNET0", "CUSTOM_VNIC_TYPE_DIRECT"]

...unless we contrive some inventory unit to put on the agent RPs. What would that be? VNIC? How would we know how many to create? Interestingly, the above is closely approaching the space we're exploring for "subtree affinity". I'm wondering if there's a Unified Solution...

...

For example, if we said that "traits always flow down [4]" (the phrase that entered my brain and got me to start this email, "down" in this case is "in the direction of children") then some traits could be on the compute node, but expressed in a numbered request group if that happened to be more convenient.

This mental model works well for me, because nested often represents a _containing_ hierarchy [2].

If the "compute RP has no resources to give [...] but it's still the thing exposing traits we want to filter by" [3], if we make it so the children inherit those traits (because they have flowed down and the children are "inside" the thing) things feel a bit saner to me. Would be good if Eric were able to express in more detail why inherit feels "terrible" [3]. It could very well be.

I also said "feels". I can't really explain it any better than I could explain why "using group numbers as values" gave me the ooks. And given we're coming up ugly with all other proposals, convince me that this one is practical and not fraught with peril and I'll quickly get over my discomfort. Right now I'm pretty close to that point because it elegantly solves both classes of problem described above, and I can't think of a way to break it that isn't ridiculously contrived. It's possible we punted on it before because a) we didn't have the concrete use cases we have now; and b) it was going to be pretty tricky to implement. More on that below.

...

Similarly, aggregate membership would flow down as well, because a child is always in its parent's aggregate too because it is inside its parent.

This one I'm not so convinced about. Can we defer making changes here until we have similarly concrete use cases?

...

A numeric requiredN or member_ofN span would be capped by the resource provider that satisfied resourcesN.

Eh? I was following you up to this point. Do you just mean that we don't have to worry about ascending the tree looking for requiredN because the trait is implicitly on the provider with resourceN by virtue of being on its ancestor?

...

We need to work out a consistent and relatively easy to explain mental model for this, because we need to be able to talk about it with other people without them being required to re-experience all the mental hurdles we are having to overcome.

I think the hurdles are more around "why" and "are you sure you want to" - once we've made those decisions, IMO it can be understood fairly easily with one or both of "encapsulation" and "traits flow down" as you've explained them.

...

[4] A corollary could be "classes of inventory always flow up": If you need a SRIOV_NET_VF, this root resource provider can provide it because it has a great grandchild which has it.

This one bakes my noodle pretty good. I have a harder time visualizing how the above use cases are satisfied by walking backwards up the tree accumulating resources (and you have to accumulate the traits as well, right?) until I hit a point where I've gathered everything I need. So I'll come down in favor of making "traits flow down" happen. Question is, how? (And I know we've talked about this before - maybe Queens+Denver?) (A) In the database. (i) Any time a trait is added to a provider, we create records for same trait for all descendants. (ii) Need a data migration to bring existing data into conformance with ^ (iii) When a trait is deleted from a provider, I assume we need to recursively delete it from all descendants. If you didn't want that, you'd have to go back and re-add it to the descendants you wanted it on. Pros: Easy to do. We don't have to change any of the APIs' algorithms - they just work the way we want them to by virtue of the trait data being where we want it. Reporting (e.g. GET /rps and therefore CLI output) reflects "reality". Cons: Irreversible. Not backward compatible. Can't do it in a microversion. (B) In the algorithms. (i) GET /rps and GET /a_cs queries need JOINs I can't even begin to comprehend. (ii) Do we tweak the outputs (GET /rps response and GET /a_cs provider_summaries) to report the "inherited" traits as well? Pros: Can do it in a microversion. Cons: See "can't even begin to comprehend". Maybe I'm a dunce. Perhaps this suggests a hybrid approach: (C) Create a "ghost" table of inherited resource provider traits. If $old_microversion we ignore it; if $new_microversion we logically combine it with the existing rp traits table in all our queries. Thoughts? efried .

Chris Dent

8:23 p.m.

On Tue, 16 Apr 2019, Eric Fried wrote:

...

Interestingly, the above is closely approaching the space we're exploring for "subtree affinity". I'm wondering if there's a Unified Solution...

Some kind of unified solution is what I'm trying to find too, mostly by noodling.

...

I also said "feels". I can't really explain it any better than I could explain why "using group numbers as values" gave me the ooks. And given we're coming up ugly with all other proposals, convince me that this one is practical and not fraught with peril and I'll quickly get over my discomfort. Right now I'm pretty close to that point because it elegantly solves both classes of problem described above, and I can't think of a way to break it that isn't ridiculously contrived.

My message isn't a proposal, it's a question to see if there is a proposal in there somewhere.

...

...
Similarly, aggregate membership would flow down as well, because a child is always in its parent's aggregate too because it is inside its parent.

This one I'm not so convinced about. Can we defer making changes here until we have similarly concrete use cases?

I mentioned it because Tetsuro is suggesting that we likely do or will have use cases, in thread http://lists.openstack.org/pipermail/openstack-discuss/2019-April/thread.htm... and if traits flow down and aggregate membership does not, then we don't have a grand unified theory and a big point of my message is gone.

...

...
A numeric requiredN or member_ofN span would be capped by the resource provider that satisfied resourcesN.

Eh? I was following you up to this point. Do you just mean that we don't have to worry about ascending the tree looking for requiredN because the trait is implicitly on the provider with resourceN by virtue of being on its ancestor?

No, I think I just got lost trying to explain myself and thinking about making sure that distinct number groups still manage to stay distinct. So let's go with "ignore that paragraph".

...

...
We need to work out a consistent and relatively easy to explain mental model for this, because we need to be able to talk about it with other people without them being required to re-experience all the mental hurdles we are having to overcome.

I think the hurdles are more around "why" and "are you sure you want to" - once we've made those decisions, IMO it can be understood fairly easily with one or both of "encapsulation" and "traits flow down" as you've explained them.

I think perhaps you've been in this too long to recognize how mind boggling some of the concepts (and the inconsistencies thereof) can be. If we are able to come up with not just a unified solution, but a unified theory that supports that solution, it's a huge win for those that follow us.

...

...
[4] A corollary could be "classes of inventory always flow up": If you need a SRIOV_NET_VF, this root resource provider can provide it because it has a great grandchild which has it.

This one bakes my noodle pretty good. I have a harder time visualizing how the above use cases are satisfied by walking backwards up the tree accumulating resources (and you have to accumulate the traits as well, right?) until I hit a point where I've gathered everything I need.

Classes flowing up is already true for unnumbered resources. If you don't specify which group you want it from, any child can provide it. That's what I was describing. The associated question would be if a numbered resources should do the same, which presumably it should not because that would pretty much violate the point of numbered, wouldn't it? Which begs the question: Why/How are traits and maybe aggregates different? I think the answer is the containership.

...

So I'll come down in favor of making "traits flow down" happen. Question is, how? (And I know we've talked about this before - maybe Queens+Denver?)

(A) In the database.

This feels like it misrepresents the reality and the relationships and the concept of _nested_. But the way the data is represented in the DB doesn't have to directly map to the meaning, so it might be a goer. A worthwhile member of the list of alternatives.

...

(B) In the algorithms. (i) GET /rps and GET /a_cs queries need JOINs I can't even begin to comprehend.

If we wanted to we could figure this out (or do a D as another way round). I think if we're going to commit to use a SQL db as the datastore, we can do quite a bit of work to improve our SQL, probably by writing some SQL, directly, in a manual test environment to find out the right queries (or just get a dump out of Jay's brain). Doing recursive or graph queries in tables is a thing people do, it can be looked up on stackoverflow, etc. That is, I think this is a surmountable problem and the organic approach we've had up to now may make it look harder than it might be. Of course I could be completely wrong about that. I haven't tried it yet, and I don't expect to have any chance before mid May.

...

(ii) Do we tweak the outputs (GET /rps response and GET /a_cs provider_summaries) to report the "inherited" traits as well?

That seems wrong to me. a) we don't list anything but uuid, name and generation on /rps, b) in /a_cs the meaningful thing is the groups of providers. That inheritance satisfied a trait isn't meaningful, it's that something in the group did, isn't it? Plus, we don't want to misrepresent the reality of traits back to, for example, the weighers in nova-scheduler.

...

Perhaps this suggests a hybrid approach:

(C) Create a "ghost" table of inherited resource provider traits. If $old_microversion we ignore it; if $new_microversion we logically combine it with the existing rp traits table in all our queries.

This sounds a lot like a 'view', in which case if a view is possible, then B is possible. Also, a ghost table sounds a whole lot like a cache and makes me scared (as a ghost should I suppose) and: https://twitter.com/jaypipes/status/1113457517538021376 (D) Switch to a graph db. I think this is an important avenue of exploration for greenfield deployments but pretty much no good for existing deployments given the historical shyness about compatibility, migrations, etc. (as well as fatigue that people seem to have from being compelled to do a migration to get existing placement). However, the exploration could very well reveal some techniques that are transferable. Thanks for making a list of suggestions. Having ideas to compare is awesome. I think the immediate next step is continue sharing and comparing. -- Chris Dent ٩◔̯◔۶ https://anticdent.org/ freenode: cdent tw: @anticdent

Eric Fried

10:07 p.m.

...

...
(B) In the algorithms. (i) GET /rps and GET /a_cs queries need JOINs I can't even begin to comprehend.

If we wanted to we could figure this out (or do a D as another way round).

Sure, for some values of "we". In relative terms, I have "high confidence" that I could write (and maintain!) the code for (A) or (C); but "low confidence" that I could do the same for (B) or (D). Now, does "traits flow down" help us at all with affinity? I'm gonna say... not really. It would make it easy for us to tag NUMA nodes with a *specific* trait (CUSTOM_NUMA_NODE_0) and affine multiple request groups to that *specific* NUMA node. But then we need multiple permutations of each request, which is feels horrible (with same caveat as before about horrible feeling). efried .

Alex Xu

17 Apr 17 Apr

9:16 a.m.

Eric Fried <openstack@fried.cc> 于2019年4月17日周三上午3:28写道：

...

...
I'm not sure I understand your proposal. Would you introduce a VM resource and then allocate 1 of that resource for each VM?

This has been proposed before, somewhere: translating max_instances_per_host to an inventory of resource class "VM" on the compute node RP, and including resources:VM=1 in every GET /a_c request.

Actually, I propose to attach traits to the RP which has the VCPU resource. In the case, we have NUMA in placement. We will attach traits to the numa node RP. I just try to explain why that may makes sense. Since the those traits should be attached to the "Compute" resource, and the VCPU is just that "Compute" resource.(yes, we have two numa nodes, then both two numa nodes has those traits, but it should be fine). When we are asking those traits, then we must asking the VCPU, right? If so, it sounds make sense. Or is there any case we only request a trait, totally no resource requesting? If yes, that may not works. But I think the case we begin to discussion is about the trait and resource aren't in the same RP. For the neutron bw case, we still attach nic type trait to the PF, not the agent, for the same reason.

...

This would solve the class of use cases like:

...
...
So we have a specific use case: COMPUTE_TRUSTED_CERTS + NUMA

but wouldn't help us for:

...
So what we need to solve is. Two (or more) sets of resources where the different sets requires different, contradicting traits, in a setup where the trait is not on the RP where resource inventory is.

compute RP | | |____ OVS agent RP | | * CUSTOM_VNIC_TYPE_NORMAL | | | |___________ br-int dev RP | * CUSTOM_PHYSNET_PHYSNET0 | * NET_BW_EGR_KILOBIT_PER_SEC: 1000 | | |____ SRIOV agent RP | | * CUSTOM_VNIC_TYPE_DIRECT | | | | | |___________ esn1 dev RP | | * CUSTOM_PHYSNET_PHYSNET0 | | * NET_BW_EGR_KILOBIT_PER_SEC: 10000 | | | |___________ esn2 dev RP | * CUSTOM_PHYSNET_PHYSNET1 | * NET_BW_EGR_KILOBIT_PER_SEC: 20000

Then having two neutron ports in a server create request: * port-normal: "resource_request": { "resources": { orc.NET_BW_EGR_KILOBIT_PER_SEC: 1000}, "required": ["CUSTOM_PHYSNET0", "CUSTOM_VNIC_TYPE_NORMAL"]

* port-direct: "resource_request": { "resources": { orc.NET_BW_EGR_KILOBIT_PER_SEC: 2000}, "required": ["CUSTOM_PHYSNET0", "CUSTOM_VNIC_TYPE_DIRECT"]

...unless we contrive some inventory unit to put on the agent RPs. What would that be? VNIC? How would we know how many to create?

Interestingly, the above is closely approaching the space we're exploring for "subtree affinity". I'm wondering if there's a Unified Solution...

yea, if we are going to create a virtual resource 'VM', then we need 'VNIC', and then we need more. I don't like that.

...

...
For example, if we said that "traits always flow down [4]" (the phrase that entered my brain and got me to start this email, "down" in this case is "in the direction of children") then some traits could be on the compute node, but expressed in a numbered request group if that happened to be more convenient.

This mental model works well for me, because nested often represents a _containing_ hierarchy [2].

If the "compute RP has no resources to give [...] but it's still the thing exposing traits we want to filter by" [3], if we make it so the children inherit those traits (because they have flowed down and the children are "inside" the thing) things feel a bit saner to me. Would be good if Eric were able to express in more detail why inherit feels "terrible" [3]. It could very well be.

I also said "feels". I can't really explain it any better than I could explain why "using group numbers as values" gave me the ooks. And given we're coming up ugly with all other proposals, convince me that this one is practical and not fraught with peril and I'll quickly get over my discomfort. Right now I'm pretty close to that point because it elegantly solves both classes of problem described above, and I can't think of a way to break it that isn't ridiculously contrived.

It's possible we punted on it before because a) we didn't have the concrete use cases we have now; and b) it was going to be pretty tricky to implement. More on that below.

...
Similarly, aggregate membership would flow down as well, because a child is always in its parent's aggregate too because it is inside its parent.

This one I'm not so convinced about. Can we defer making changes here until we have similarly concrete use cases?

...
A numeric requiredN or member_ofN span would be capped by the resource provider that satisfied resourcesN.

Eh? I was following you up to this point. Do you just mean that we don't have to worry about ascending the tree looking for requiredN because the trait is implicitly on the provider with resourceN by virtue of being on its ancestor?

...
We need to work out a consistent and relatively easy to explain mental model for this, because we need to be able to talk about it with other people without them being required to re-experience all the mental hurdles we are having to overcome.

I think the hurdles are more around "why" and "are you sure you want to" - once we've made those decisions, IMO it can be understood fairly easily with one or both of "encapsulation" and "traits flow down" as you've explained them.

...
[4] A corollary could be "classes of inventory always flow up": If you need a SRIOV_NET_VF, this root resource provider can provide it because it has a great grandchild which has it.

This one bakes my noodle pretty good. I have a harder time visualizing how the above use cases are satisfied by walking backwards up the tree accumulating resources (and you have to accumulate the traits as well, right?) until I hit a point where I've gathered everything I need.

So I'll come down in favor of making "traits flow down" happen. Question is, how? (And I know we've talked about this before - maybe Queens+Denver?)

(A) In the database. (i) Any time a trait is added to a provider, we create records for same trait for all descendants. (ii) Need a data migration to bring existing data into conformance with ^ (iii) When a trait is deleted from a provider, I assume we need to recursively delete it from all descendants. If you didn't want that, you'd have to go back and re-add it to the descendants you wanted it on.

Pros: Easy to do. We don't have to change any of the APIs' algorithms - they just work the way we want them to by virtue of the trait data being where we want it. Reporting (e.g. GET /rps and therefore CLI output) reflects "reality". Cons: Irreversible. Not backward compatible. Can't do it in a microversion.

(B) In the algorithms. (i) GET /rps and GET /a_cs queries need JOINs I can't even begin to comprehend. (ii) Do we tweak the outputs (GET /rps response and GET /a_cs provider_summaries) to report the "inherited" traits as well?

Pros: Can do it in a microversion. Cons: See "can't even begin to comprehend". Maybe I'm a dunce.

Perhaps this suggests a hybrid approach:

(C) Create a "ghost" table of inherited resource provider traits. If $old_microversion we ignore it; if $new_microversion we logically combine it with the existing rp traits table in all our queries.

Thoughts?

efried .

2418

Age (days ago)

2431

Last active (days ago)

List overview

Download

19 comments

6 participants

participants (6)

Alex Xu
Balázs Gibizer
Chris Dent
Ed Leafe
Eric Fried
Jay Pipes