[nova] Spec: Standardize CPU resource tracking

newer
[all] "Popup teams" proposal about...

older
[scientific-sig] IRC meeting today...

Shewale, Bhagyashri

12 Jun 2019 12 Jun '19

2:10 a.m.

Hi All, Currently I am working on implementation of cpu pinning upgrade part as mentioned in the spec [1]. While implementing the scheduler pre-filter as mentioned in [1], I have encountered one big issue: Proposed change in spec: In scheduler pre-filter we are going to alias request_spec.flavor.extra_spec and request_spec.image.properties form ``hw:cpu_policy`` to ``resources=(V|P)CPU:${flavor.vcpus}`` of existing instances. So when user will create a new instance or execute instance actions like shelve, unshelve, resize, evacuate and migration post upgrade it will go through scheduler pre-filter which will set alias for `hw:cpu_policy` in request_spec flavor ``extra specs`` and image metadata properties. In below particular case, it won’t work:- For example: I have two compute nodes say A and B: On Stein: Compute node A configurations: vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata) Compute node B Configuration: vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata) On Train, two possible scenarios: Compute node A configurations: (Consider the new cpu pinning implementation is merged into Train) vcpu_pin_set=0-3 (Keep same settings as in Stein) Compute node B Configuration: (Consider the new cpu pinning implementation is merged into Train) cpu_dedicated_set=0-3 (change to the new config option) 1. Consider that one instance say `test ` is created using flavor having old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") in Stein release and now upgraded Nova to Train with the above configuration. 2. Now when user will perform instance action say shelve/unshelve scheduler pre-filter will change the request_spec flavor extra spec from ``hw:cpu_policy`` to ``resources=PCPU:$<no. of cpus>`` which ultimately will return only compute node B from placement service. Here, we expect it should have retuned both Compute A and Compute B. 3. If user creates a new instance using old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") on Train release with the above configuration then it will return only compute node B from placement service where as it should have returned both compute Node A and B. Problem: As Compute node A is still configured to be used to boot instances with dedicated CPUs same behavior as Stein, it will not be returned by placement service due to the changes in the scheduler pre-filter logic. Propose changes: Earlier in the spec [2]: The online data migration was proposed to change flavor extra specs and image metadata properties of request_spec and instance object. Based on the instance host, we can get the NumaTopology of the host which will contain the new configuration options set on the compute host. Based on the NumaTopology of host, we can change instance and request_spec flavor extra specs. 1. Remove cpu_policy from extra specs 2. Add “resources:PCPU=<count>” in extra specs We can also change the flavor extra specs and image metadata properties of instance and request_spec object using the reshape functionality. Please give us your feedback on the proposed solution so that we can update specs accordingly. [1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources.... [2]: https://review.opendev.org/#/c/555081/23..28/specs/train/approved/cpu-resour... Thanks and Regards, -Bhagyashri Shewale- Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Attachments:

attachment.html (text/html — 11.8 KB)

Show replies by date

Shewale, Bhagyashri

12 Jun 12 Jun

9:42 p.m.

Hi All, After revisiting the spec [1] again and again, I got to know few points please check and let me know about my understanding: Understanding: If the ``vcpu_pin_set`` is set on compute node A in the Stein release then we can say that this node is used to host the dedicated instance on it and if user upgrades from Stein to Train and if operator doesn’t define ``[compute] cpu_dedicated_set`` set then simply fallback to ``vcpu_pin_set`` and report it as PCPU inventory. Considering multiple combinations of various configuration options, I think we will need to implement below business rules so that the issue highlighted in the previous email about the scheduler pre-filter can be solved. Rule 1: If operator sets ``[compute] cpu_shared_set`` in Train. 1.If pinned instances are found then we can simply say that this compute node is used as dedicated in the previous release so raise an error that says to set ``[compute] cpu_dedicated_set`` config option otherwise report it as VCPU inventory. Rule 2: If operator sets ``[compute] cpu_dedicated_set`` in Train. 1. Report inventory as PCPU 2. If instances are found, check for host numa topology pinned_cpus, if pinned_cpus is not empty, that means this compute node is used as dedicated in the previous release and if empty, then raise an error that this compute node is used as shared compute node in previous release. Rule 3: If operator sets None of the options (``[compute] cpu_dedicated_set``, ``[compute] cpu_shared_set``, ``vcpu_pin_set``) in Train. 1. If instances are found, check for host numa topology pinned_cpus, if pinned_cpus is not empty, then raise an error that this compute node is used as dedicated compute node in previous release so set ``[compute] cpu_dedicated_set``, otherwise report inventory as VCPU. 2. If no instances, report inventory as VCPU. Rule 4: If operator sets ``vcpu_pin_set`` config option in Train. 1. If instances are found, check for host numa topology pinned_cpus, if pinned_cpus is empty, that means this compute node is used for non-pinned instances in the previous release, so raise an error otherwise report it as PCPU inventory. 2. If no instances, report inventory as PCPU. Rule 5: If operator sets ``vcpu_pin_set`` and ``[compute] cpu_dedicated_set`` or ``[compute] cpu_shared_set`` config options in Train 1. Simply raise an error Above business rules 3 and 4 are very important in order to solve the scheduler pre-filter issue highlighted in my previous email. As of today, in either case, `vcpu_pin_set`` is set or not set on the compute node, it can used for both pinned or non-pinned instances depending on whether this host belongs to an aggregate with “pinned” metadata. But as per business rule #3 , if ``vcpu_pin_set`` is not set, we are considering it to be used for non-pinned instances only. Do you think this could cause an issue in providing backward compatibility? Please provide your suggestions on the above business rules. [1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources.... Thanks and Regards, -Bhagyashri Shewale- ________________________________ From: Shewale, Bhagyashri Sent: Wednesday, June 12, 2019 6:10:04 PM To: openstack-discuss@lists.openstack.org; openstack@fried.cc; smooney@redhat.com; sfinucan@redhat.com; jaypipes@gmail.com Subject: [nova] Spec: Standardize CPU resource tracking Hi All, Currently I am working on implementation of cpu pinning upgrade part as mentioned in the spec [1]. While implementing the scheduler pre-filter as mentioned in [1], I have encountered one big issue: Proposed change in spec: In scheduler pre-filter we are going to alias request_spec.flavor.extra_spec and request_spec.image.properties form ``hw:cpu_policy`` to ``resources=(V|P)CPU:${flavor.vcpus}`` of existing instances. So when user will create a new instance or execute instance actions like shelve, unshelve, resize, evacuate and migration post upgrade it will go through scheduler pre-filter which will set alias for `hw:cpu_policy` in request_spec flavor ``extra specs`` and image metadata properties. In below particular case, it won’t work:- For example: I have two compute nodes say A and B: On Stein: Compute node A configurations: vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata) Compute node B Configuration: vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata) On Train, two possible scenarios: Compute node A configurations: (Consider the new cpu pinning implementation is merged into Train) vcpu_pin_set=0-3 (Keep same settings as in Stein) Compute node B Configuration: (Consider the new cpu pinning implementation is merged into Train) cpu_dedicated_set=0-3 (change to the new config option) 1. Consider that one instance say `test ` is created using flavor having old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") in Stein release and now upgraded Nova to Train with the above configuration. 2. Now when user will perform instance action say shelve/unshelve scheduler pre-filter will change the request_spec flavor extra spec from ``hw:cpu_policy`` to ``resources=PCPU:$<no. of cpus>`` which ultimately will return only compute node B from placement service. Here, we expect it should have retuned both Compute A and Compute B. 3. If user creates a new instance using old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") on Train release with the above configuration then it will return only compute node B from placement service where as it should have returned both compute Node A and B. Problem: As Compute node A is still configured to be used to boot instances with dedicated CPUs same behavior as Stein, it will not be returned by placement service due to the changes in the scheduler pre-filter logic. Propose changes: Earlier in the spec [2]: The online data migration was proposed to change flavor extra specs and image metadata properties of request_spec and instance object. Based on the instance host, we can get the NumaTopology of the host which will contain the new configuration options set on the compute host. Based on the NumaTopology of host, we can change instance and request_spec flavor extra specs. 1. Remove cpu_policy from extra specs 2. Add “resources:PCPU=<count>” in extra specs We can also change the flavor extra specs and image metadata properties of instance and request_spec object using the reshape functionality. Please give us your feedback on the proposed solution so that we can update specs accordingly. [1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources.... [2]: https://review.opendev.org/#/c/555081/23..28/specs/train/approved/cpu-resour... Thanks and Regards, -Bhagyashri Shewale- Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Sean Mooney

13 Jun 13 Jun

4:21 a.m.

...

Hi All,

After revisiting the spec [1] again and again, I got to know few points please check and let me know about my understanding:

Understanding: If the ``vcpu_pin_set`` is set on compute node A in the Stein release then we can say that this node is used to host the dedicated instance on it and if user upgrades from Stein to Train and if operator doesn’t define ``[compute] cpu_dedicated_set`` set then simply fallback to ``vcpu_pin_set`` and report it as PCPU inventory.

On Thu, 2019-06-13 at 04:42 +0000, Shewale, Bhagyashri wrote: that is incorrect if the vcpu_pin_set is defiend it may be used for instance with hw:cpu_policy=dedicated or not. in train if vcpu_pin_set is defiend and cpu_dedicated_set is not defiend then we use vcpu_pin_set to define the inventory of both PCPUs and VCPUs

...

Considering multiple combinations of various configuration options, I think we will need to implement below business rules so that the issue highlighted in the previous email about the scheduler pre-filter can be solved.

Rule 1:

If operator sets ``[compute] cpu_shared_set`` in Train.

1.If pinned instances are found then we can simply say that this compute node is used as dedicated in the previous release so raise an error that says to set ``[compute] cpu_dedicated_set`` config option otherwise report it as VCPU inventory.

cpu_share_set in stien was used for vm emulator thread and required the instnace to be pinned for it to take effect. i.e. the hw:emulator_thread_policy extra spcec currently only works if you had hw_cpu_policy=dedicated. so we should not error if vcpu_pin_set and cpu_shared_set are defined, it was valid. what we can do is ignore teh cpu_shared_set for schduling and not report 0 VCPUs for this host and use vcpu_pinned_set as PCPUs

...

Rule 2:

If operator sets ``[compute] cpu_dedicated_set`` in Train.

1. Report inventory as PCPU

yes if cpu_dedicated_set is set we will report its value as PCPUs

...

2. If instances are found, check for host numa topology pinned_cpus, if pinned_cpus is not empty, that means this compute node is used as dedicated in the previous release and if empty, then raise an error that this compute node is used as shared compute node in previous release.

this was not part of the spec. we could do this but i think its not needed and operators should check this themselves. if we decide to do this check on startup it should only happen if vcpu_pin_set is defined. addtionally we can log an error but we should not prevent the compute node form working and contuing to spawn vms.

...

Rule 3:

If operator sets None of the options (``[compute] cpu_dedicated_set``, ``[compute] cpu_shared_set``, ``vcpu_pin_set``) in Train.

1. If instances are found, check for host numa topology pinned_cpus, if pinned_cpus is not empty, then raise an error that this compute node is used as dedicated compute node in previous release so set ``[compute] cpu_dedicated_set``, otherwise report inventory as VCPU.

again this is not in the spec and i dont think we should do this. if none of the values are set we should report all cpus as both VCPUs and PCPUs the vcpu_pin_set option was never intended to signal a host was used for cpu pinning it was intoduced for cpu pinning and numa affinity but it was orignally ment to apply to floaing instance and currently contople the number of VCPU reported to the resouce tracker which is used to set the capastiy of the VCPU inventory. you should read https://that.guru/blog/cpu-resources/ for a walk through of this.

...

2. If no instances, report inventory as VCPU.

we could do this but i think it will be confusing as to what will happen after we spawn an instnace on the host in train. i dont think this logic should be condtional on the presence of vms.

...

Rule 4:

If operator sets ``vcpu_pin_set`` config option in Train.

1. If instances are found, check for host numa topology pinned_cpus, if pinned_cpus is empty, that means this compute node is used for non-pinned instances in the previous release, so raise an error otherwise report it as PCPU inventory.

agin this is not in the spec. what the spec says for if vcpu_pin_set is defiend is we will report inventorys of both VCPU and PCPUs for all cpus in the vcpu_pin_set

...

2. If no instances, report inventory as PCPU.

again this should not be condtional on the presence of vms.

...

Rule 5:

If operator sets ``vcpu_pin_set`` and ``[compute] cpu_dedicated_set`` or ``[compute] cpu_shared_set`` config options in Train

1. Simply raise an error

this is the only case were we "rasise" and error and refuse to start the compute node.

...

Above business rules 3 and 4 are very important in order to solve the scheduler pre-filter issue highlighted in my previous email.

we explctly do not want to have the behavior in 3 and 4 specificly the logic of checking the instances.

...

As of today, in either case, `vcpu_pin_set`` is set or not set on the compute node, it can used for both pinned or non-pinned instances depending on whether this host belongs to an aggregate with “pinned” metadata. But as per business rule #3 , if ``vcpu_pin_set`` is not set, we are considering it to be used for non-pinned instances only. Do you think this could cause an issue in providing backward compatibility?

yes the rule you have listed above will cause issue for upgrades and we rejected similar rules in the spec. i have not read your previous email which ill look at next but we spent a long time debating how this should work in the spec design and i would prefer to stick to what the spec currently states.

...

Please provide your suggestions on the above business rules.

[1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources....

Thanks and Regards,

-Bhagyashri Shewale-

________________________________ From: Shewale, Bhagyashri Sent: Wednesday, June 12, 2019 6:10:04 PM To: openstack-discuss@lists.openstack.org; openstack@fried.cc; smooney@redhat.com; sfinucan@redhat.com; jaypipes@gmail.com Subject: [nova] Spec: Standardize CPU resource tracking

Hi All,

Currently I am working on implementation of cpu pinning upgrade part as mentioned in the spec [1].

While implementing the scheduler pre-filter as mentioned in [1], I have encountered one big issue:

Proposed change in spec: In scheduler pre-filter we are going to alias request_spec.flavor.extra_spec and request_spec.image.properties form ``hw:cpu_policy`` to ``resources=(V|P)CPU:${flavor.vcpus}`` of existing instances.

So when user will create a new instance or execute instance actions like shelve, unshelve, resize, evacuate and migration post upgrade it will go through scheduler pre-filter which will set alias for `hw:cpu_policy` in request_spec flavor ``extra specs`` and image metadata properties. In below particular case, it won’t work:-

For example:

I have two compute nodes say A and B:

On Stein:

Compute node A configurations:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata)

Compute node B Configuration:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata)

On Train, two possible scenarios:

Compute node A configurations: (Consider the new cpu pinning implementation is merged into Train)

vcpu_pin_set=0-3 (Keep same settings as in Stein)

Compute node B Configuration: (Consider the new cpu pinning implementation is merged into Train)

cpu_dedicated_set=0-3 (change to the new config option)

1. Consider that one instance say `test ` is created using flavor having old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") in Stein release and now upgraded Nova to Train with the above configuration. 2. Now when user will perform instance action say shelve/unshelve scheduler pre-filter will change the request_spec flavor extra spec from ``hw:cpu_policy`` to ``resources=PCPU:$<no. of cpus>`` which ultimately will return only compute node B from placement service. Here, we expect it should have retuned both Compute A and Compute B. 3. If user creates a new instance using old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") on Train release with the above configuration then it will return only compute node B from placement service where as it should have returned both compute Node A and B.

Problem: As Compute node A is still configured to be used to boot instances with dedicated CPUs same behavior as Stein, it will not be returned by placement service due to the changes in the scheduler pre-filter logic.

Propose changes:

Earlier in the spec [2]: The online data migration was proposed to change flavor extra specs and image metadata properties of request_spec and instance object. Based on the instance host, we can get the NumaTopology of the host which will contain the new configuration options set on the compute host. Based on the NumaTopology of host, we can change instance and request_spec flavor extra specs.

1. Remove cpu_policy from extra specs 2. Add “resources:PCPU=<count>” in extra specs

We can also change the flavor extra specs and image metadata properties of instance and request_spec object using the reshape functionality.

Please give us your feedback on the proposed solution so that we can update specs accordingly.

[1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources....

[2]: https://review.opendev.org/#/c/555081/23..28/specs/train/approved/cpu-resour...

Thanks and Regards,

-Bhagyashri Shewale-

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Shewale, Bhagyashri

14 Jun 14 Jun

1:35 a.m.

...

...
cpu_share_set in stien was used for vm emulator thread and required the instnace to be pinned for it to take effect.

...

...
i.e. the hw:emulator_thread_policy extra spcec currently only works if you had hw_cpu_policy=dedicated.

...

...
so we should not error if vcpu_pin_set and cpu_shared_set are defined, it was valid. what we can do is

...

...
ignore teh cpu_shared_set for schduling and not report 0 VCPUs for this host and use vcpu_pinned_set as PCPUs

Thinking of backward compatibility, I agree both of these configuration options ``cpu_shared_set``, ``vcpu_pinned_set`` should be allowed in Train release as well. Few possible combinations in train: A) What if only ``cpu_shared_set`` is set on a new compute node? Report VCPU inventory. B) what if ``cpu_shared_set`` and ``cpu_dedicated_set`` are set on a new compute node? Report VCPU and PCPU inventory. In fact, we want to support both these options so that instance can request both VCPU and PCPU at the same time. If flavor requests VCPU or hw:emulator_thread_policy=share, in both the cases, it will float on CPUs set in ``cpu_shared_set`` config option. C) What if ``cpu_shared_set`` and ``vcpu_pin_set`` are set on a new compute node? Ignore cpu_shared_set and report vcpu_pinned_set as VCPU or PCPU? D) What if ``cpu_shared_set`` and ``vcpu_pin_set`` are set on a upgraded compute node? As you have mentioned, ignore cpu_shared_set and report vcpu_pinned_set as PCPUs provided ``NumaTopology`` ,``pinned_cpus`` attribute is not empty otherwise VCPU.

...

...
we explctly do not want to have the behavior in 3 and 4 specificly the logic of checking the instances.

...

Hi All,

After revisiting the spec [1] again and again, I got to know few points please check and let me know about my understanding:

Understanding: If the ``vcpu_pin_set`` is set on compute node A in the Stein release then we can say that this node is used to host the dedicated instance on it and if user upgrades from Stein to Train and if operator doesn’t define ``[compute] cpu_dedicated_set`` set then simply fallback to ``vcpu_pin_set`` and report it as PCPU inventory.

Here we are checking Host ``NumaTopology`` ,``pinned_cpus`` attribute and not directly instances ( if that attribute is not empty that means some instance are running) and this logic will be needed to address above #D case. Regards, -Bhagyashri Shewale- ________________________________ From: Sean Mooney <smooney@redhat.com> Sent: Thursday, June 13, 2019 8:21:09 PM To: Shewale, Bhagyashri; openstack-discuss@lists.openstack.org; openstack@fried.cc; sfinucan@redhat.com; jaypipes@gmail.com Subject: Re: [nova] Spec: Standardize CPU resource tracking On Thu, 2019-06-13 at 04:42 +0000, Shewale, Bhagyashri wrote: that is incorrect if the vcpu_pin_set is defiend it may be used for instance with hw:cpu_policy=dedicated or not. in train if vcpu_pin_set is defiend and cpu_dedicated_set is not defiend then we use vcpu_pin_set to define the inventory of both PCPUs and VCPUs

...

Considering multiple combinations of various configuration options, I think we will need to implement below business rules so that the issue highlighted in the previous email about the scheduler pre-filter can be solved.

Rule 1:

If operator sets ``[compute] cpu_shared_set`` in Train.

1.If pinned instances are found then we can simply say that this compute node is used as dedicated in the previous release so raise an error that says to set ``[compute] cpu_dedicated_set`` config option otherwise report it as VCPU inventory.

...

Rule 2:

If operator sets ``[compute] cpu_dedicated_set`` in Train.

1. Report inventory as PCPU

yes if cpu_dedicated_set is set we will report its value as PCPUs

...

2. If instances are found, check for host numa topology pinned_cpus, if pinned_cpus is not empty, that means this compute node is used as dedicated in the previous release and if empty, then raise an error that this compute node is used as shared compute node in previous release.

...

Rule 3:

If operator sets None of the options (``[compute] cpu_dedicated_set``, ``[compute] cpu_shared_set``, ``vcpu_pin_set``) in Train.

1. If instances are found, check for host numa topology pinned_cpus, if pinned_cpus is not empty, then raise an error that this compute node is used as dedicated compute node in previous release so set ``[compute] cpu_dedicated_set``, otherwise report inventory as VCPU.

...

2. If no instances, report inventory as VCPU.

we could do this but i think it will be confusing as to what will happen after we spawn an instnace on the host in train. i dont think this logic should be condtional on the presence of vms.

...

Rule 4:

If operator sets ``vcpu_pin_set`` config option in Train.

1. If instances are found, check for host numa topology pinned_cpus, if pinned_cpus is empty, that means this compute node is used for non-pinned instances in the previous release, so raise an error otherwise report it as PCPU inventory.

agin this is not in the spec. what the spec says for if vcpu_pin_set is defiend is we will report inventorys of both VCPU and PCPUs for all cpus in the vcpu_pin_set

...

2. If no instances, report inventory as PCPU.

again this should not be condtional on the presence of vms.

...

Rule 5:

If operator sets ``vcpu_pin_set`` and ``[compute] cpu_dedicated_set`` or ``[compute] cpu_shared_set`` config options in Train

1. Simply raise an error

this is the only case were we "rasise" and error and refuse to start the compute node.

...

Above business rules 3 and 4 are very important in order to solve the scheduler pre-filter issue highlighted in my previous email.

we explctly do not want to have the behavior in 3 and 4 specificly the logic of checking the instances.

...

As of today, in either case, `vcpu_pin_set`` is set or not set on the compute node, it can used for both pinned or non-pinned instances depending on whether this host belongs to an aggregate with “pinned” metadata. But as per business rule #3 , if ``vcpu_pin_set`` is not set, we are considering it to be used for non-pinned instances only. Do you think this could cause an issue in providing backward compatibility?

...

Please provide your suggestions on the above business rules.

[1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources....

Thanks and Regards,

-Bhagyashri Shewale-

________________________________ From: Shewale, Bhagyashri Sent: Wednesday, June 12, 2019 6:10:04 PM To: openstack-discuss@lists.openstack.org; openstack@fried.cc; smooney@redhat.com; sfinucan@redhat.com; jaypipes@gmail.com Subject: [nova] Spec: Standardize CPU resource tracking

Hi All,

Currently I am working on implementation of cpu pinning upgrade part as mentioned in the spec [1].

While implementing the scheduler pre-filter as mentioned in [1], I have encountered one big issue:

Proposed change in spec: In scheduler pre-filter we are going to alias request_spec.flavor.extra_spec and request_spec.image.properties form ``hw:cpu_policy`` to ``resources=(V|P)CPU:${flavor.vcpus}`` of existing instances.

So when user will create a new instance or execute instance actions like shelve, unshelve, resize, evacuate and migration post upgrade it will go through scheduler pre-filter which will set alias for `hw:cpu_policy` in request_spec flavor ``extra specs`` and image metadata properties. In below particular case, it won’t work:-

For example:

I have two compute nodes say A and B:

On Stein:

Compute node A configurations:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata)

Compute node B Configuration:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata)

On Train, two possible scenarios:

Compute node A configurations: (Consider the new cpu pinning implementation is merged into Train)

vcpu_pin_set=0-3 (Keep same settings as in Stein)

Compute node B Configuration: (Consider the new cpu pinning implementation is merged into Train)

cpu_dedicated_set=0-3 (change to the new config option)

1. Consider that one instance say `test ` is created using flavor having old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") in Stein release and now upgraded Nova to Train with the above configuration. 2. Now when user will perform instance action say shelve/unshelve scheduler pre-filter will change the request_spec flavor extra spec from ``hw:cpu_policy`` to ``resources=PCPU:$<no. of cpus>`` which ultimately will return only compute node B from placement service. Here, we expect it should have retuned both Compute A and Compute B. 3. If user creates a new instance using old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") on Train release with the above configuration then it will return only compute node B from placement service where as it should have returned both compute Node A and B.

Problem: As Compute node A is still configured to be used to boot instances with dedicated CPUs same behavior as Stein, it will not be returned by placement service due to the changes in the scheduler pre-filter logic.

Propose changes:

Earlier in the spec [2]: The online data migration was proposed to change flavor extra specs and image metadata properties of request_spec and instance object. Based on the instance host, we can get the NumaTopology of the host which will contain the new configuration options set on the compute host. Based on the NumaTopology of host, we can change instance and request_spec flavor extra specs.

1. Remove cpu_policy from extra specs 2. Add “resources:PCPU=<count>” in extra specs

We can also change the flavor extra specs and image metadata properties of instance and request_spec object using the reshape functionality.

Please give us your feedback on the proposed solution so that we can update specs accordingly.

[1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources....

[2]: https://review.opendev.org/#/c/555081/23..28/specs/train/approved/cpu-resour...

Thanks and Regards,

-Bhagyashri Shewale-

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Stephen Finucane

17 Jun 17 Jun

3:10 a.m.

[Cleaning up the 'To' field since Jay isn't working on OpenStack anymore and everyone else is on openstack-discuss already] On Fri, 2019-06-14 at 08:35 +0000, Shewale, Bhagyashri wrote:

...

...
cpu_share_set in stien was used for vm emulator thread and required the instnace to be pinned for it to take effect. i.e. the hw:emulator_thread_policy extra spcec currently only works if you had hw_cpu_policy=dedicated so we should not error if vcpu_pin_set and cpu_shared_set are defined, it was valid. what we can do is ignore teh cpu_shared_set for schduling and not report 0 VCPUs for this host and use vcpu_pinned_set as PCPUs.

...

Thinking of backward compatibility, I agree both of these configuration options ``cpu_shared_set``, ``vcpu_pinned_set`` should be allowed in Train release as well.

Few possible combinations in train: A) What if only ``cpu_shared_set`` is set on a new compute node? Report VCPU inventory.

I think this is _very_ unlikely to happen in the real world since the lack of a 'vcpu_pin_set' option means an instances pinned CPUs could co-exist on the same cores as the emulator threats, which defeats the whole point of placing emulator threads on a separate core. That said, it's possible so we do have to deal with it. Ignore 'cpu_shared_set' in this case and issue a warning saying that the user has to configure 'cpu_dedicated_set'.

...

B) what if ``cpu_shared_set`` and ``cpu_dedicated_set`` are set on a new compute node? Report VCPU and PCPU inventory. In fact, we want to support both these options so that instance can request both VCPU and PCPU at the same time. If flavor requests VCPU or hw:emulator_thread_policy=share, in both the cases, it will float on CPUs set in ``cpu_shared_set`` config option.

We should report both VCPU and PCPU inventory, yes. However, please don't add the ability to create a single instance with combined VCPU and PCPU inventory. I dropped this from the spec intentionally to make it easier for something (_anything_) to land. We can iterate on this once we have the basics done.

...

C) What if ``cpu_shared_set`` and ``vcpu_pin_set`` are set on a new compute node? Ignore cpu_shared_set and report vcpu_pinned_set as VCPU or PCPU?

As above, ignore 'cpu_shared_set' but issue a warning. Use the value of 'vcpu_pin_set' to report both VCPU and PCPU inventory. Note that 'vcpu_pin_set' is already used to calculate VCPU inventory. https://opendev.org/openstack/nova/src/branch/master/nova/virt/libvirt/drive...

...

D) What if ``cpu_shared_set`` and ``vcpu_pin_set`` are set on a upgraded compute node? As you have mentioned, ignore cpu_shared_set and report vcpu_pinned_set as PCPUs provided ``NumaTopology`` ,``pinned_cpus`` attribute is not empty otherwise VCPU.

Ignore 'cpu_shared_set' but issue a warning. Use the value of 'vcpu_pin_set' to report both VCPU and PCPU inventory. Note that 'vcpu_pin_set' is already used to calculate VCPU inventory.

...

...
we explctly do not want to have the behavior in 3 and 4 specificly the logic of checking the instances.

Here we are checking Host ``NumaTopology`` ,``pinned_cpus`` attribute and not directly instances ( if that attribute is not empty that means some instance are running) and this logic will be needed to address above #D case.

Shewale, Bhagyashri

11:41 p.m.

...

...
As above, ignore 'cpu_shared_set' but issue a warning. Use the value of

...

...
‘vcpu_pin_set' to report both VCPU and PCPU inventory. Note that

...

...
‘vcpu_pin_set' is already used to calculate VCPU inventory.

As mentioned in the spec, If operator sets the ``vcpu_pin_set`` in the Stein and upgrade to Train then both VCPU and PCPU inventory should be reported in placement. As on current master (Stein) if operator sets ``vpcu_pin_set=0-3`` on Compute node A and adds that node A into the host aggregate say “agg1” having metadata ``pinned=true``, then it allows to create both pinned and non-pinned instances which is known big issue. 1. Create instance A having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true") then instance A will float on cpus 0-3 2. Create the instance B having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true", "hw:cpu_policy": "dedicated") then instance B will be pinned to one of the cpu say 0. Now, operator will do the upgrade (Stein to Train), nova compute will report both VCPU and PCPU inventory. In this case if cpu_allocation_ratio is 1, then total PCPU available will be 4 (vpcu_pin_set=0-3) and VCPU will also be 4. And this will allow user to create maximum of 4 instances with flavor extra specs ``resources:PCPU=1`` and 4 instances with flavor extra specs ``resources:VCPU=1``. With current master code, it’s possible to create only 4 instances where now, by reporting both VCPU and PCPU, it will allow user to create total of 8 instances which is adding another level of problem along with the existing known issue. Is this acceptable? because this is decorating the problems. If not acceptable, then we can report only PCPU in this case which will solve two problems:- 1. The existing known issue on current master (allowing both pinned and non-pinned instances) on the compute host meant for pinning. 2. Above issue of allowing 8 instances to be created on the host. But there is one problem in taking this decision, if no instances are running on the compute node in case only ``vcpu_pinned_set`` is set, how do you find out this compute node is configured to create pinned or non-pinned instances? If instances are running, based on the Host numa_topology.pinned_cpus, it’s possible to detect that. Regards, Bhagyashri Shewale ________________________________ From: Stephen Finucane <sfinucan@redhat.com> Sent: Monday, June 17, 2019 7:10:28 PM To: Shewale, Bhagyashri; openstack-discuss@lists.openstack.org Subject: Re: [nova] Spec: Standardize CPU resource tracking [Cleaning up the 'To' field since Jay isn't working on OpenStack anymore and everyone else is on openstack-discuss already] On Fri, 2019-06-14 at 08:35 +0000, Shewale, Bhagyashri wrote:

...

...
cpu_share_set in stien was used for vm emulator thread and required the instnace to be pinned for it to take effect. i.e. the hw:emulator_thread_policy extra spcec currently only works if you had hw_cpu_policy=dedicated so we should not error if vcpu_pin_set and cpu_shared_set are defined, it was valid. what we can do is ignore teh cpu_shared_set for schduling and not report 0 VCPUs for this host and use vcpu_pinned_set as PCPUs.

...

Thinking of backward compatibility, I agree both of these configuration options ``cpu_shared_set``, ``vcpu_pinned_set`` should be allowed in Train release as well.

Few possible combinations in train: A) What if only ``cpu_shared_set`` is set on a new compute node? Report VCPU inventory.

...

B) what if ``cpu_shared_set`` and ``cpu_dedicated_set`` are set on a new compute node? Report VCPU and PCPU inventory. In fact, we want to support both these options so that instance can request both VCPU and PCPU at the same time. If flavor requests VCPU or hw:emulator_thread_policy=share, in both the cases, it will float on CPUs set in ``cpu_shared_set`` config option.

...

C) What if ``cpu_shared_set`` and ``vcpu_pin_set`` are set on a new compute node? Ignore cpu_shared_set and report vcpu_pinned_set as VCPU or PCPU?

...

D) What if ``cpu_shared_set`` and ``vcpu_pin_set`` are set on a upgraded compute node? As you have mentioned, ignore cpu_shared_set and report vcpu_pinned_set as PCPUs provided ``NumaTopology`` ,``pinned_cpus`` attribute is not empty otherwise VCPU.

Ignore 'cpu_shared_set' but issue a warning. Use the value of 'vcpu_pin_set' to report both VCPU and PCPU inventory. Note that 'vcpu_pin_set' is already used to calculate VCPU inventory.

...

...
we explctly do not want to have the behavior in 3 and 4 specificly the logic of checking the instances.

Here we are checking Host ``NumaTopology`` ,``pinned_cpus`` attribute and not directly instances ( if that attribute is not empty that means some instance are running) and this logic will be needed to address above #D case.

You shouldn't need to do this. Rely solely on configuration options to determine inventory, even if it means reporting more inventory than we actually have (reporting of a host core as both units of VCPU and PCPU) and hope that operators have correctly used host aggregrates to isolate NUMA-based instances from non-NUMA-based instances. I realize this is very much in flux but could you please push what you have up for review, marked as WIP or such. Debating this stuff in the code might be easier. Stephen Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Stephen Finucane

18 Jun 18 Jun

2:51 a.m.

On Tue, 2019-06-18 at 06:41 +0000, Shewale, Bhagyashri wrote:

...

...
As above, ignore 'cpu_shared_set' but issue a warning. Use the value of ‘vcpu_pin_set' to report both VCPU and PCPU inventory. Note that ‘vcpu_pin_set' is already used to calculate VCPU inventory.

As mentioned in the spec, If operator sets the ``vcpu_pin_set`` in the Stein and upgrade to Train then both VCPU and PCPU inventory should be reported in placement.

As on current master (Stein) if operator sets ``vpcu_pin_set=0-3`` on Compute node A and adds that node A into the host aggregate say “agg1” having metadata ``pinned=true``, then it allows to create both pinned and non-pinned instances which is known big issue. Create instance A having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true") then instance A will float on cpus 0-3 Create the instance B having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true", "hw:cpu_policy": "dedicated") then instance B will be pinned to one of the cpu say 0. Now, operator will do the upgrade (Stein to Train), nova compute will report both VCPU and PCPU inventory. In this case if cpu_allocation_ratio is 1, then total PCPU available will be 4 (vpcu_pin_set=0-3) and VCPU will also be 4. And this will allow user to create maximum of 4 instances with flavor extra specs ``resources:PCPU=1`` and 4 instances with flavor extra specs ``resources:VCPU=1``.

If the cpu_allocation_ratio is 1.0 then yes, this is correct. However, if it's any greater (and remember, the default is 16.0) then the gap is much smaller, though still broken.

...

With current master code, it’s possible to create only 4 instances where now, by reporting both VCPU and PCPU, it will allow user to create total of 8 instances which is adding another level of problem along with the existing known issue. Is this acceptable? because this is decorating the problems.

I think is acceptable, yes. As we've said, this is broken behavior and things are just slightly more broken here, though not horribly so. As it stands, if you don't isolate pinned instances from non-pinned instances, you don't get any of the guarantees pinning is supposed to provide. Using the above example, if you booted two pinned and two unpinned instances on the same host, the unpinned instances would float over the pinned instances' cores [*] and impact their performance. If performance is an issue, host aggregrates will have been used. [*] They'll actually float over the entire range of host cores since instnace without a NUMA topology don't respect the 'vcpu_pin_set' value.

...

If not acceptable, then we can report only PCPU in this case which will solve two problems:- The existing known issue on current master (allowing both pinned and non-pinned instances) on the compute host meant for pinning. Above issue of allowing 8 instances to be created on the host. But there is one problem in taking this decision, if no instances are running on the compute node in case only ``vcpu_pinned_set`` is set, how do you find out this compute node is configured to create pinned or non-pinned instances? If instances are running, based on the Host numa_topology.pinned_cpus, it’s possible to detect that.

Alex Xu

6:33 a.m.

New subject: [nova] Spec: Standardize CPU resource tracking

Stephen Finucane <sfinucan@redhat.com> 于2019年6月18日周二下午5:55写道：

...

On Tue, 2019-06-18 at 06:41 +0000, Shewale, Bhagyashri wrote:

...
...
As above, ignore 'cpu_shared_set' but issue a warning. Use the value of ‘vcpu_pin_set' to report both VCPU and PCPU inventory. Note that ‘vcpu_pin_set' is already used to calculate VCPU inventory.

As mentioned in the spec, If operator sets the ``vcpu_pin_set`` in the Stein and upgrade to Train then both VCPU and PCPU inventory should be reported in placement.

As on current master (Stein) if operator sets ``vpcu_pin_set=0-3`` on Compute node A and adds that node A into the host aggregate say “agg1” having metadata ``pinned=true``, then it allows to create both pinned and non-pinned instances which is known big issue. Create instance A having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true") then instance A will float on cpus 0-3 Create the instance B having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true", "hw:cpu_policy": "dedicated") then instance B will be pinned to one of the cpu say 0. Now, operator will do the upgrade (Stein to Train), nova compute will report both VCPU and PCPU inventory. In this case if cpu_allocation_ratio is 1, then total PCPU available will be 4 (vpcu_pin_set=0-3) and VCPU will also be 4. And this will allow user to create maximum of 4 instances with flavor extra specs ``resources:PCPU=1`` and 4 instances with flavor extra specs ``resources:VCPU=1``.

If the cpu_allocation_ratio is 1.0 then yes, this is correct. However, if it's any greater (and remember, the default is 16.0) then the gap is much smaller, though still broken.

...
With current master code, it’s possible to create only 4 instances where now, by reporting both VCPU and PCPU, it will allow user to create total of 8 instances which is adding another level of problem along with the existing known issue. Is this acceptable? because this is decorating the problems.

I think is acceptable, yes. As we've said, this is broken behavior and things are just slightly more broken here, though not horribly so. As it stands, if you don't isolate pinned instances from non-pinned instances, you don't get any of the guarantees pinning is supposed to provide. Using the above example, if you booted two pinned and two unpinned instances on the same host, the unpinned instances would float over the pinned instances' cores [*] and impact their performance. If performance is an issue, host aggregrates will have been used.

[*] They'll actually float over the entire range of host cores since instnace without a NUMA topology don't respect the 'vcpu_pin_set' value.

Yes, agree with Stephen, we don't suggest the user mix the pin and non-pin instance on the same host with current master. If user want to mix pin and non-pin instance, the user need update his configuration to use dedicated_cpu_set and shared_cpu_set. The vcpu_pin_set reports VCPU and PCPU inventories is the intermediate status. In that intermediate status, the operator still need to separate the pin and non-pin instance into different host.

...

...
If not acceptable, then we can report only PCPU in this case which will solve two problems:- The existing known issue on current master (allowing both pinned and non-pinned instances) on the compute host meant for pinning. Above issue of allowing 8 instances to be created on the host. But there is one problem in taking this decision, if no instances are running on the compute node in case only ``vcpu_pinned_set`` is set, how do you find out this compute node is configured to create pinned or non-pinned instances? If instances are running, based on the Host numa_topology.pinned_cpus, it’s possible to detect that.

As noted previously, this is too complex and too error prone. Let's just suffer the potential additional impact on performance for those who haven't correctly configured their deployment, knowing that as soon as they get to U, where we can require the 'cpu_dedicated_set' and 'cpu_shared_set' options if you want to use pinned instances, things will be fixed.

Stephen

Shewale, Bhagyashri

7:56 p.m.

Hi All, After all discussions on mailing thread, I would like to summarize concluded points as under:- 1. If operator sets ``vcpu_pin_set`` in Stein and upgrade it to Train or ``vcpu_pin_set`` is set on a new compute node, then both VCPU and PCPU inventory should be reported to placement. 2. User can’t request both ``resources:PCPU`` and ``resources:VCPU`` in a single request for Train release. And in future ‘U’ release, user can request both ``resources:PCPU`` and ``resources:VCPU`` in a single request. 3. In “U” release, “vcpu_pin_set” config option will be removed. In this case, operator will either need to set “cpu_shared_set” or “cpu_dedicated_set” accordingly on old compute nodes and on new compute nodes, operator can set both the config option “cpu_shared_set” and “cpu_dedicated_set” if required. 4. In Train release, operator will also need to continue retaining the same behavior of host aggregates as that of Stein to differentiate between Numa-awared compute host: * Hosts meant for pinned instances should be part of the aggregate with metadata “pinned=True” * Hosts meant for non-pinned instances should be part of the aggregate with metadata “pinned=False” 5. In Train release, old flavor can be used as is in which case scheduler pre-filter will map it to the next syntax “resources:PCPU” in case cpu_policy=dedicated. 6. In Train release, new flavor syntax “resources:PCPU=1 will be accepted in flavor extra specs but in this case we expect operator will set “aggregate_instance_extra_specs:pinned=True” in flavor extra specs and the hosts are part of the aggregate which has metadata “pinned=True”. Regards, Bhagyashri Shewale ________________________________ From: Stephen Finucane <sfinucan@redhat.com> Sent: Tuesday, June 18, 2019 6:51:15 PM To: Shewale, Bhagyashri; openstack-discuss@lists.openstack.org Subject: Re: [nova] Spec: Standardize CPU resource tracking On Tue, 2019-06-18 at 06:41 +0000, Shewale, Bhagyashri wrote:

...

...
As above, ignore 'cpu_shared_set' but issue a warning. Use the value of ‘vcpu_pin_set' to report both VCPU and PCPU inventory. Note that ‘vcpu_pin_set' is already used to calculate VCPU inventory.

As mentioned in the spec, If operator sets the ``vcpu_pin_set`` in the Stein and upgrade to Train then both VCPU and PCPU inventory should be reported in placement.

As on current master (Stein) if operator sets ``vpcu_pin_set=0-3`` on Compute node A and adds that node A into the host aggregate say “agg1” having metadata ``pinned=true``, then it allows to create both pinned and non-pinned instances which is known big issue. Create instance A having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true") then instance A will float on cpus 0-3 Create the instance B having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true", "hw:cpu_policy": "dedicated") then instance B will be pinned to one of the cpu say 0. Now, operator will do the upgrade (Stein to Train), nova compute will report both VCPU and PCPU inventory. In this case if cpu_allocation_ratio is 1, then total PCPU available will be 4 (vpcu_pin_set=0-3) and VCPU will also be 4. And this will allow user to create maximum of 4 instances with flavor extra specs ``resources:PCPU=1`` and 4 instances with flavor extra specs ``resources:VCPU=1``.

If the cpu_allocation_ratio is 1.0 then yes, this is correct. However, if it's any greater (and remember, the default is 16.0) then the gap is much smaller, though still broken.

...

With current master code, it’s possible to create only 4 instances where now, by reporting both VCPU and PCPU, it will allow user to create total of 8 instances which is adding another level of problem along with the existing known issue. Is this acceptable? because this is decorating the problems.

...

If not acceptable, then we can report only PCPU in this case which will solve two problems:- The existing known issue on current master (allowing both pinned and non-pinned instances) on the compute host meant for pinning. Above issue of allowing 8 instances to be created on the host. But there is one problem in taking this decision, if no instances are running on the compute node in case only ``vcpu_pinned_set`` is set, how do you find out this compute node is configured to create pinned or non-pinned instances? If instances are running, based on the Host numa_topology.pinned_cpus, it’s possible to detect that.

As noted previously, this is too complex and too error prone. Let's just suffer the potential additional impact on performance for those who haven't correctly configured their deployment, knowing that as soon as they get to U, where we can require the 'cpu_dedicated_set' and 'cpu_shared_set' options if you want to use pinned instances, things will be fixed. Stephen Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Alex Xu

8:59 p.m.

New subject: [nova] Spec: Standardize CPU resource tracking

Shewale, Bhagyashri <Bhagyashri.Shewale@nttdata.com> 于2019年6月19日周三上午11:01写道：

...

Hi All,

After all discussions on mailing thread, I would like to summarize concluded points as under:-

1. If operator sets ``vcpu_pin_set`` in Stein and upgrade it to Train or ``vcpu_pin_set`` is set on a new compute node, then both VCPU and PCPU inventory should be reported to placement. 2. User can’t request both ``resources:PCPU`` and ``resources:VCPU`` in a single request for Train release. And in future ‘U’ release, user can request both ``resources:PCPU`` and ``resources:VCPU`` in a single request. 3. In “U” release, “vcpu_pin_set” config option will be removed. In this case, operator will either need to set “cpu_shared_set” or “cpu_dedicated_set” accordingly on old compute nodes and on new compute nodes, operator can set both the config option “cpu_shared_set” and “cpu_dedicated_set” if required. 4. In Train release, operator will also need to continue retaining the same behavior of host aggregates as that of Stein to differentiate between Numa-awared compute host: - Hosts meant for pinned instances should be part of the aggregate with metadata “pinned=True” - Hosts meant for non-pinned instances should be part of the aggregate with metadata “pinned=False” 5. In Train release, old flavor can be used as is in which case scheduler pre-filter will map it to the next syntax “resources:PCPU” in case cpu_policy=dedicated.

+1 all above

...

1. In Train release, new flavor syntax “resources:PCPU=1 will be accepted in flavor extra specs but in this case we expect operator will set “aggregate_instance_extra_specs:pinned=True” in flavor extra specs and the hosts are part of the aggregate which has metadata “pinned=True”.

If the user finished the upgrade, and switch to dedicated_cpu_set and

shared_cpu_set, then he needn't aggregate anymore. For using resources:PCPU directly, I'm not sure. I have talk with Sean few days ago, we both think that we shouldn't allow the user using "resources" extra spec directly. Thinking about the future, we have numa on placement, the resources and traits extra spec can't express all the guest numa info. Like which numa node is the first one. And it is hard to parse the guest numa topo from those extra spec, and it isn't human readable. Also "hw:" provides some abstraction than resources/traits extra spec, allow us to do some upgrade without asking user update their flavor. But this isn't critical problem for now, until we have numa on placement.

...

Regards,

Bhagyashri Shewale

------------------------------ *From:* Stephen Finucane <sfinucan@redhat.com> *Sent:* Tuesday, June 18, 2019 6:51:15 PM *To:* Shewale, Bhagyashri; openstack-discuss@lists.openstack.org *Subject:* Re: [nova] Spec: Standardize CPU resource tracking

On Tue, 2019-06-18 at 06:41 +0000, Shewale, Bhagyashri wrote:

...
...
As above, ignore 'cpu_shared_set' but issue a warning. Use the value of ‘vcpu_pin_set' to report both VCPU and PCPU inventory. Note that ‘vcpu_pin_set' is already used to calculate VCPU inventory.

As mentioned in the spec, If operator sets the ``vcpu_pin_set`` in the Stein and upgrade to Train then both VCPU and PCPU inventory should be reported in placement.

As on current master (Stein) if operator sets ``vpcu_pin_set=0-3`` on Compute node A and adds that node A into the host aggregate say “agg1” having metadata ``pinned=true``, then it allows to create both pinned and non-pinned instances which is known big issue. Create instance A having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true") then instance A will float on cpus 0-3 Create the instance B having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true", "hw:cpu_policy": "dedicated") then instance B will be pinned to one of the cpu say 0. Now, operator will do the upgrade (Stein to Train), nova compute will report both VCPU and PCPU inventory. In this case if cpu_allocation_ratio is 1, then total PCPU available will be 4 (vpcu_pin_set=0-3) and VCPU will also be 4. And this will allow user to create maximum of 4 instances with flavor extra specs ``resources:PCPU=1`` and 4 instances with flavor extra specs ``resources:VCPU=1``.

If the cpu_allocation_ratio is 1.0 then yes, this is correct. However, if it's any greater (and remember, the default is 16.0) then the gap is much smaller, though still broken.

...
With current master code, it’s possible to create only 4 instances where now, by reporting both VCPU and PCPU, it will allow user to create total of 8 instances which is adding another level of problem along with the existing known issue. Is this acceptable? because this is decorating the problems.

I think is acceptable, yes. As we've said, this is broken behavior and things are just slightly more broken here, though not horribly so. As it stands, if you don't isolate pinned instances from non-pinned instances, you don't get any of the guarantees pinning is supposed to provide. Using the above example, if you booted two pinned and two unpinned instances on the same host, the unpinned instances would float over the pinned instances' cores [*] and impact their performance. If performance is an issue, host aggregrates will have been used.

[*] They'll actually float over the entire range of host cores since instnace without a NUMA topology don't respect the 'vcpu_pin_set' value.

...
If not acceptable, then we can report only PCPU in this case which will solve two problems:- The existing known issue on current master (allowing both pinned and non-pinned instances) on the compute host meant for pinning. Above issue of allowing 8 instances to be created on the host. But there is one problem in taking this decision, if no instances are running on the compute node in case only ``vcpu_pinned_set`` is set, how do you find out this compute node is configured to create pinned or non-pinned instances? If instances are running, based on the Host numa_topology.pinned_cpus, it’s possible to detect that.

As noted previously, this is too complex and too error prone. Let's just suffer the potential additional impact on performance for those who haven't correctly configured their deployment, knowing that as soon as they get to U, where we can require the 'cpu_dedicated_set' and 'cpu_shared_set' options if you want to use pinned instances, things will be fixed.

Stephen

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Sean Mooney

19 Jun 19 Jun

4:29 a.m.

New subject: [nova] Spec: Standardize CPU resource tracking

On Wed, 2019-06-19 at 11:59 +0800, Alex Xu wrote:

...

Shewale, Bhagyashri <Bhagyashri.Shewale@nttdata.com> 于2019年6月19日周三上午11:01写道：

...
Hi All,

After all discussions on mailing thread, I would like to summarize concluded points as under:-

1. If operator sets ``vcpu_pin_set`` in Stein and upgrade it to Train or ``vcpu_pin_set`` is set on a new compute node, then both VCPU and PCPU inventory should be reported to placement. 2. User can’t request both ``resources:PCPU`` and ``resources:VCPU`` in a single request for Train release. And in future ‘U’ release, user can request both ``resources:PCPU`` and ``resources:VCPU`` in a single request. 3. In “U” release, “vcpu_pin_set” config option will be removed. In this case, operator will either need to set “cpu_shared_set” or “cpu_dedicated_set” accordingly on old compute nodes and on new compute nodes, operator can set both the config option “cpu_shared_set” and “cpu_dedicated_set” if required. 4. In Train release, operator will also need to continue retaining the same behavior of host aggregates as that of Stein to differentiate between Numa-awared compute host: - Hosts meant for pinned instances should be part of the aggregate with metadata “pinned=True” - Hosts meant for non-pinned instances should be part of the aggregate with metadata “pinned=False” 5. In Train release, old flavor can be used as is in which case scheduler pre-filter will map it to the next syntax “resources:PCPU” in case cpu_policy=dedicated.

+1 all above

1. In Train release, new flavor syntax “resources:PCPU=1 will be accepted in flavor extra specs but in this case we expect operator will set “aggregate_instance_extra_specs:pinned=True” in flavor extra specs and the hosts are part of the aggregate which has metadata “pinned=True”.

If the user finished the upgrade, and switch to dedicated_cpu_set and

shared_cpu_set, then he needn't aggregate anymore. yes once they remove vcpu_pin_set and define share_cpu_set or dedicated_cpu_set they no longer need agggates. the aggates are just needed to cover the time after they have upgraded but before the reconfigure as that shoudl be done as two discreet actions and can be seperated by a long time period.

...

For using resources:PCPU directly, I'm not sure. I have talk with Sean few days ago, we both think that we shouldn't allow the user using "resources" extra spec directly. Thinking about the future, we have numa on placement, the resources and traits extra spec can't express all the guest numa info. Like which numa node is the first one. And it is hard to parse the guest numa topo from those extra spec, and it isn't human readable. Also "hw:" provides some abstraction than resources/traits extra spec, allow us to do some upgrade without asking user update their flavor. But this isn't critical problem for now, until we have numa on placement.

yep i agree with ^ using "resouces:" directly is something i would discourage as it a leaky abstration that is directly mapped to the placmeent api and as a result if you use it it means you will need to update your flavors as new features are added. so for cpu pinning i would personally recommend only using hw:cpu_policy

...

...
Regards,

Bhagyashri Shewale

------------------------------ *From:* Stephen Finucane <sfinucan@redhat.com> *Sent:* Tuesday, June 18, 2019 6:51:15 PM *To:* Shewale, Bhagyashri; openstack-discuss@lists.openstack.org *Subject:* Re: [nova] Spec: Standardize CPU resource tracking

On Tue, 2019-06-18 at 06:41 +0000, Shewale, Bhagyashri wrote:

...
...
As above, ignore 'cpu_shared_set' but issue a warning. Use the value of ‘vcpu_pin_set' to report both VCPU and PCPU inventory. Note that ‘vcpu_pin_set' is already used to calculate VCPU inventory.

As mentioned in the spec, If operator sets the ``vcpu_pin_set`` in the Stein and upgrade to Train then both VCPU and PCPU inventory should be reported in placement.

As on current master (Stein) if operator sets ``vpcu_pin_set=0-3`` on Compute node A and adds that node A into the host aggregate say “agg1” having metadata ``pinned=true``, then it allows to create both pinned and non-pinned instances which is known big issue. Create instance A having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true") then instance A will float on cpus 0-3 Create the instance B having flavor extra specs ("aggregate_instance_extra_specs:pinned": "true", "hw:cpu_policy": "dedicated") then instance B will be pinned to one of the cpu say 0. Now, operator will do the upgrade (Stein to Train), nova compute will report both VCPU and PCPU inventory. In this case if cpu_allocation_ratio is 1, then total PCPU available will be 4 (vpcu_pin_set=0-3) and VCPU will also be 4. And this will allow user to create maximum of 4 instances with flavor extra specs ``resources:PCPU=1`` and 4 instances with flavor extra specs ``resources:VCPU=1``.

If the cpu_allocation_ratio is 1.0 then yes, this is correct. However, if it's any greater (and remember, the default is 16.0) then the gap is much smaller, though still broken.

...
With current master code, it’s possible to create only 4 instances where now, by reporting both VCPU and PCPU, it will allow user to create total of 8 instances which is adding another level of problem along with the existing known issue. Is this acceptable? because this is decorating the problems.

I think is acceptable, yes. As we've said, this is broken behavior and things are just slightly more broken here, though not horribly so. As it stands, if you don't isolate pinned instances from non-pinned instances, you don't get any of the guarantees pinning is supposed to provide. Using the above example, if you booted two pinned and two unpinned instances on the same host, the unpinned instances would float over the pinned instances' cores [*] and impact their performance. If performance is an issue, host aggregrates will have been used.

[*] They'll actually float over the entire range of host cores since instnace without a NUMA topology don't respect the 'vcpu_pin_set' value.

...
If not acceptable, then we can report only PCPU in this case which will solve two problems:- The existing known issue on current master (allowing both pinned and non-pinned instances) on the compute host meant for pinning. Above issue of allowing 8 instances to be created on the host. But there is one problem in taking this decision, if no instances are running on the compute node in case only ``vcpu_pinned_set`` is set, how do you find out this compute node is configured to create pinned or non-pinned instances? If instances are running, based on the Host numa_topology.pinned_cpus, it’s possible to detect that.

As noted previously, this is too complex and too error prone. Let's just suffer the potential additional impact on performance for those who haven't correctly configured their deployment, knowing that as soon as they get to U, where we can require the 'cpu_dedicated_set' and 'cpu_shared_set' options if you want to use pinned instances, things will be fixed.

Stephen

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Sean Mooney

13 Jun 13 Jun

4:32 a.m.

On Wed, 2019-06-12 at 09:10 +0000, Shewale, Bhagyashri wrote:

...

Hi All,

Currently I am working on implementation of cpu pinning upgrade part as mentioned in the spec [1].

While implementing the scheduler pre-filter as mentioned in [1], I have encountered one big issue:

Proposed change in spec: In scheduler pre-filter we are going to alias request_spec.flavor.extra_spec and request_spec.image.properties form ``hw:cpu_policy`` to ``resources=(V|P)CPU:${flavor.vcpus}`` of existing instances.

So when user will create a new instance or execute instance actions like shelve, unshelve, resize, evacuate and migration post upgrade it will go through scheduler pre-filter which will set alias for `hw:cpu_policy` in request_spec flavor ``extra specs`` and image metadata properties. In below particular case, it won’t work:-

For example:

I have two compute nodes say A and B:

On Stein:

Compute node A configurations:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata) vcpu_pin_set does not mean that the host was used for pinned instances https://that.guru/blog/cpu-resources/

Compute node B Configuration:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata)

On Train, two possible scenarios:

Compute node A configurations: (Consider the new cpu pinning implementation is merged into Train)

vcpu_pin_set=0-3 (Keep same settings as in Stein)

Compute node B Configuration: (Consider the new cpu pinning implementation is merged into Train)

cpu_dedicated_set=0-3 (change to the new config option)

1. Consider that one instance say `test ` is created using flavor having old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") in Stein release and now upgraded Nova to Train with the above configuration. 2. Now when user will perform instance action say shelve/unshelve scheduler pre-filter will change the request_spec flavor extra spec from ``hw:cpu_policy`` to ``resources=PCPU:$<no. of cpus>`` it wont remove hw:cpu_policy it will just change the resouces=VCPU:$<no. of cpus> -> resources=PCPU:$<no. of cpus>

...

which ultimately will return only compute node B from placement service. that is incorrect both a and by will be returned. the spec states that for host A we report an inventory of 4 VCPUs and an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so both host will be returned assuming $<no. of cpus> <=4

...

Here, we expect it should have retuned both Compute A and Compute B. it will 3. If user creates a new instance using old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") on Train release with the above configuration then it will return only compute node B from placement service where as it should have returned both compute Node A and B. that is what would have happend in the stien version of the spec and we changed the spec specifically to ensure that that wont happen. in the train version of the spec you will get both host as candates to prevent this upgrade impact.

Problem: As Compute node A is still configured to be used to boot instances with dedicated CPUs same behavior as Stein, it will not be returned by placement service due to the changes in the scheduler pre-filter logic.

Propose changes:

Earlier in the spec [2]: The online data migration was proposed to change flavor extra specs and image metadata properties of request_spec and instance object. Based on the instance host, we can get the NumaTopology of the host which will contain the new configuration options set on the compute host. Based on the NumaTopology of host, we can change instance and request_spec flavor extra specs.

1. Remove cpu_policy from extra specs 2. Add “resources:PCPU=<count>” in extra specs

We can also change the flavor extra specs and image metadata properties of instance and request_spec object using the reshape functionality.

Please give us your feedback on the proposed solution so that we can update specs accordingly. i am fairly stongly opposed to useing an online data migration to modify the request spec to reflect the host they landed on. this speficic problem is why the spec was changed in the train cycle to report dual inventoryis of VCPU and PCPU if vcpu_pin_set is the only option set or of no options are set.

[1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources....

[2]: https://review.opendev.org/#/c/555081/23..28/specs/train/approved/cpu-resour...

Thanks and Regards,

-Bhagyashri Shewale-

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Shewale, Bhagyashri

14 Jun 14 Jun

1:37 a.m.

...

...
that is incorrect both a and by will be returned. the spec states that for host A we report an inventory of 4 VCPUs and

...

...
an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so both host will be returned assuming

...

...
$<no. of cpus> <=4

Means if ``vcpu_pin_set`` is set in previous release then report both VCPU and PCPU as inventory (in Train) but this seems contradictory for example: On Stein, Configuration on compute node A: vcpu_pin_set=0-3 (This will report 4 VCPUs inventory in placement database) On Train: vcpu_pin_set=0-3 The inventory will be reported as 4 VCPUs and 4 PCPUs in the placement db Now say user wants to create instances as below: 1. Flavor having extra specs (resources:PCPU=1), instance A 2. Flavor having extra specs (resources:VCPU=1), instance B For both instance requests, placement will return compute Node A. Instance A: will be pinned to say 0 CPU Instance B: will float on 0-3 To resolve above issue, I think it’s possible to detect whether the compute node was configured to be used for pinned instances if ``NumaTopology`` ``pinned_cpus`` attribute is not empty. In that case, vcpu_pin_set will be reported as PCPU otherwise VCPU. Regards, -Bhagyashri Shewale- ________________________________ From: Sean Mooney <smooney@redhat.com> Sent: Thursday, June 13, 2019 8:32:02 PM To: Shewale, Bhagyashri; openstack-discuss@lists.openstack.org; openstack@fried.cc; sfinucan@redhat.com; jaypipes@gmail.com Subject: Re: [nova] Spec: Standardize CPU resource tracking On Wed, 2019-06-12 at 09:10 +0000, Shewale, Bhagyashri wrote:

...

Hi All,

Currently I am working on implementation of cpu pinning upgrade part as mentioned in the spec [1].

While implementing the scheduler pre-filter as mentioned in [1], I have encountered one big issue:

Proposed change in spec: In scheduler pre-filter we are going to alias request_spec.flavor.extra_spec and request_spec.image.properties form ``hw:cpu_policy`` to ``resources=(V|P)CPU:${flavor.vcpus}`` of existing instances.

So when user will create a new instance or execute instance actions like shelve, unshelve, resize, evacuate and migration post upgrade it will go through scheduler pre-filter which will set alias for `hw:cpu_policy` in request_spec flavor ``extra specs`` and image metadata properties. In below particular case, it won’t work:-

For example:

I have two compute nodes say A and B:

On Stein:

Compute node A configurations:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata) vcpu_pin_set does not mean that the host was used for pinned instances https://that.guru/blog/cpu-resources/

Compute node B Configuration:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata)

On Train, two possible scenarios:

Compute node A configurations: (Consider the new cpu pinning implementation is merged into Train)

vcpu_pin_set=0-3 (Keep same settings as in Stein)

Compute node B Configuration: (Consider the new cpu pinning implementation is merged into Train)

cpu_dedicated_set=0-3 (change to the new config option)

1. Consider that one instance say `test ` is created using flavor having old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") in Stein release and now upgraded Nova to Train with the above configuration. 2. Now when user will perform instance action say shelve/unshelve scheduler pre-filter will change the request_spec flavor extra spec from ``hw:cpu_policy`` to ``resources=PCPU:$<no. of cpus>`` it wont remove hw:cpu_policy it will just change the resouces=VCPU:$<no. of cpus> -> resources=PCPU:$<no. of cpus>

...

which ultimately will return only compute node B from placement service. that is incorrect both a and by will be returned. the spec states that for host A we report an inventory of 4 VCPUs and an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so both host will be returned assuming $<no. of cpus> <=4

...

Here, we expect it should have retuned both Compute A and Compute B. it will 3. If user creates a new instance using old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") on Train release with the above configuration then it will return only compute node B from placement service where as it should have returned both compute Node A and B. that is what would have happend in the stien version of the spec and we changed the spec specifically to ensure that that wont happen. in the train version of the spec you will get both host as candates to prevent this upgrade impact.

Problem: As Compute node A is still configured to be used to boot instances with dedicated CPUs same behavior as Stein, it will not be returned by placement service due to the changes in the scheduler pre-filter logic.

Propose changes:

Earlier in the spec [2]: The online data migration was proposed to change flavor extra specs and image metadata properties of request_spec and instance object. Based on the instance host, we can get the NumaTopology of the host which will contain the new configuration options set on the compute host. Based on the NumaTopology of host, we can change instance and request_spec flavor extra specs.

1. Remove cpu_policy from extra specs 2. Add “resources:PCPU=<count>” in extra specs

We can also change the flavor extra specs and image metadata properties of instance and request_spec object using the reshape functionality.

Please give us your feedback on the proposed solution so that we can update specs accordingly. i am fairly stongly opposed to useing an online data migration to modify the request spec to reflect the host they landed on. this speficic problem is why the spec was changed in the train cycle to report dual inventoryis of VCPU and PCPU if vcpu_pin_set is the only option set or of no options are set.

[1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources....

[2]: https://review.opendev.org/#/c/555081/23..28/specs/train/approved/cpu-resour...

Thanks and Regards,

-Bhagyashri Shewale-

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Alex Xu

17 Jun 17 Jun

1:45 a.m.

New subject: [nova] Spec: Standardize CPU resource tracking

I'm thinking we should have recommended upgrade follow. If we give a lot of flexibility for the operator to have a lot combination of the value of vcpu_pin_set, dedicated_cpu_set and shared_cpu_set, then we have trouble in this email and have to do a lot of checks this email introduced also. I'm thinking that the pre-request filter (which translates the cpu_policy=dedicated to PCPU request) should be enabled after all the node upgrades to the Train release. Before that, all the cpu_policy=dedicated instance still using the VCPU. Trying to image the upgrade as below: 1. Rolling upgrade the compute node. 2. The upgraded compute node begins to report both VCPU and PCPU, but reshape for the existed inventories. The upgraded node is still using the vcpu_pin_set config, or didn't set the vcpu_pin_config. Both in this two cases are reporting VCPU and PCPU same time. And the request with cpu_policy=dedicated still uses the VCPU. Then it is worked same as Stein release. And existed instance can be shelved/unshelved, migration and evacuate. 3. Disable the new request and operation for the instance to the hosts for dedicated instance. (it is kind of breaking our live-upgrade? I thought this will be a short interrupt for the control plane if that is available) 4. reshape the inventories for existed instance for all the hosts. 5. Enable the instance's new request and operation, also enable the pre-request filter. 6. Operator copies the value of vcpu_pin_set to dedicated_cpu_set. For the case of vcpu_pin_set isn't set, the value of dedicated_cpu_set should be all the cpu ids exclude shared_cpu_set if set. Two rules at here: 1. The operator doesn't allow to change a different value for dedicated_cpu_set with vcpu_pin_set when any instance is running on the host. 2. The operator doesn't allow to change the value of dedicated_cpu_set and shared_cpu_set when any instance is running on the host. Shewale, Bhagyashri <Bhagyashri.Shewale@nttdata.com> 于2019年6月14日周五下午4:42写道：

...

...
...
that is incorrect both a and by will be returned. the spec states that for host A we report an inventory of 4 VCPUs and

...
...
an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so both host will be returned assuming

...
...
$<no. of cpus> <=4

Means if ``vcpu_pin_set`` is set in previous release then report both VCPU and PCPU as inventory (in Train) but this seems contradictory for example:

On Stein,

Configuration on compute node A:

vcpu_pin_set=0-3 (This will report 4 VCPUs inventory in placement database)

On Train:

vcpu_pin_set=0-3

The inventory will be reported as 4 VCPUs and 4 PCPUs in the placement db

Now say user wants to create instances as below:

1. Flavor having extra specs (resources:PCPU=1), instance A 2. Flavor having extra specs (resources:VCPU=1), instance B

For both instance requests, placement will return compute Node A.

Instance A: will be pinned to say 0 CPU

Instance B: will float on 0-3

To resolve above issue, I think it’s possible to detect whether the compute node was configured to be used for pinned instances if ``NumaTopology`` ``pinned_cpus`` attribute is not empty. In that case, vcpu_pin_set will be reported as PCPU otherwise VCPU.

Regards,

-Bhagyashri Shewale-

------------------------------ *From:* Sean Mooney <smooney@redhat.com> *Sent:* Thursday, June 13, 2019 8:32:02 PM *To:* Shewale, Bhagyashri; openstack-discuss@lists.openstack.org; openstack@fried.cc; sfinucan@redhat.com; jaypipes@gmail.com *Subject:* Re: [nova] Spec: Standardize CPU resource tracking

...
Hi All,

Currently I am working on implementation of cpu pinning upgrade part as mentioned in the spec [1].

While implementing the scheduler pre-filter as mentioned in [1], I have encountered one big issue:

Proposed change in spec: In scheduler pre-filter we are going to alias request_spec.flavor.extra_spec and request_spec.image.properties form ``hw:cpu_policy`` to ``resources=(V|P)CPU:${flavor.vcpus}`` of existing instances.

So when user will create a new instance or execute instance actions

On Wed, 2019-06-12 at 09:10 +0000, Shewale, Bhagyashri wrote: like shelve, unshelve, resize, evacuate and

...
migration post upgrade it will go through scheduler pre-filter which will set alias for `hw:cpu_policy` in request_spec flavor ``extra specs`` and image metadata properties. In below particular case, it won’t work:-

For example:

I have two compute nodes say A and B:

On Stein:

Compute node A configurations:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata) vcpu_pin_set does not mean that the host was used for pinned instances https://that.guru/blog/cpu-resources/

Compute node B Configuration:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata)

On Train, two possible scenarios:

Compute node A configurations: (Consider the new cpu pinning implementation is merged into Train)

vcpu_pin_set=0-3 (Keep same settings as in Stein)

Compute node B Configuration: (Consider the new cpu pinning implementation is merged into Train)

cpu_dedicated_set=0-3 (change to the new config option)

1. Consider that one instance say `test ` is created using flavor having old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") in Stein release and now upgraded Nova to Train with the above configuration. 2. Now when user will perform instance action say shelve/unshelve scheduler pre-filter will change the request_spec flavor extra spec from ``hw:cpu_policy`` to ``resources=PCPU:$<no. of cpus>`` it wont remove hw:cpu_policy it will just change the resouces=VCPU:$<no. of cpus> -> resources=PCPU:$<no. of cpus>

...
which ultimately will return only compute node B from placement service. that is incorrect both a and by will be returned. the spec states that for host A we report an inventory of 4 VCPUs and an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so both host will be returned assuming $<no. of cpus> <=4

...
Here, we expect it should have retuned both Compute A and Compute B. it will 3. If user creates a new instance using old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") on Train release with the above configuration then it will return only compute node B from placement service where as it should have returned both compute Node A and B. that is what would have happend in the stien version of the spec and we changed the spec specifically to ensure that that wont happen. in the train version of the spec you will get both host as candates to prevent this upgrade impact.

Problem: As Compute node A is still configured to be used to boot instances with dedicated CPUs same behavior as Stein, it will not be returned by placement service due to the changes in the scheduler pre-filter logic.

Propose changes:

Earlier in the spec [2]: The online data migration was proposed to change flavor extra specs and image metadata properties of request_spec and instance object. Based on the instance host, we can get the NumaTopology of the host which will contain the new configuration options set on the compute host. Based on the NumaTopology of host, we can change instance and request_spec flavor extra specs.

1. Remove cpu_policy from extra specs 2. Add “resources:PCPU=<count>” in extra specs

We can also change the flavor extra specs and image metadata properties of instance and request_spec object using the reshape functionality.

Please give us your feedback on the proposed solution so that we can update specs accordingly. i am fairly stongly opposed to useing an online data migration to modify the request spec to reflect the host they landed on. this speficic problem is why the spec was changed in the train cycle to report dual inventoryis of VCPU and PCPU if vcpu_pin_set is the only option set or of no options are set.

[1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources....

[2]: https://review.opendev.org/#/c/555081/23..28/specs/train/approved/cpu-resour...

Thanks and Regards,

-Bhagyashri Shewale-

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Alex Xu

1:50 a.m.

New subject: [nova] Spec: Standardize CPU resource tracking

Alex Xu <soulxu@gmail.com> 于2019年6月17日周一下午4:45写道：

...

I'm thinking we should have recommended upgrade follow. If we give a lot of flexibility for the operator to have a lot combination of the value of vcpu_pin_set, dedicated_cpu_set and shared_cpu_set, then we have trouble in this email and have to do a lot of checks this email introduced also.

I'm thinking that the pre-request filter (which translates the cpu_policy=dedicated to PCPU request) should be enabled after all the node upgrades to the Train release. Before that, all the cpu_policy=dedicated instance still using the VCPU.

Trying to image the upgrade as below:

1. Rolling upgrade the compute node. 2. The upgraded compute node begins to report both VCPU and PCPU, but reshape for the existed inventories. The upgraded node is still using the vcpu_pin_set config, or didn't set the vcpu_pin_config. Both in this two cases are reporting VCPU and PCPU same time. And the request with cpu_policy=dedicated still uses the VCPU. Then it is worked same as Stein release. And existed instance can be shelved/unshelved, migration and evacuate. 3. Disable the new request and operation for the instance to the hosts for dedicated instance. (it is kind of breaking our live-upgrade? I thought this will be a short interrupt for the control plane if that is available) 4. reshape the inventories for existed instance for all the hosts. 5. Enable the instance's new request and operation, also enable the pre-request filter. 6. Operator copies the value of vcpu_pin_set to dedicated_cpu_set. For the case of vcpu_pin_set isn't set, the value of dedicated_cpu_set should be all the cpu ids exclude shared_cpu_set if set.

I should adjust the order of 4, 5, 6 as below: 4. Operator copies the value of vcpu_pin_set to dedicated_cpu_set. For the case of vcpu_pin_set isn't set, the value of dedicated_cpu_set should be all the cpu ids exclude shared_cpu_set if set. 5. the changing of dedicated_cpu_set triggers the reshape of the existed inventories, and remove the duplicated VCPU resources reporting. 6. Enable the instance's new request and operation, also enable the pre-request filter.

...

Two rules at here: 1. The operator doesn't allow to change a different value for dedicated_cpu_set with vcpu_pin_set when any instance is running on the host. 2. The operator doesn't allow to change the value of dedicated_cpu_set and shared_cpu_set when any instance is running on the host.

Shewale, Bhagyashri <Bhagyashri.Shewale@nttdata.com> 于2019年6月14日周五下午4:42写道：

...
...
...
that is incorrect both a and by will be returned. the spec states that for host A we report an inventory of 4 VCPUs and

...
...
an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so both host will be returned assuming

...
...
$<no. of cpus> <=4

Means if ``vcpu_pin_set`` is set in previous release then report both VCPU and PCPU as inventory (in Train) but this seems contradictory for example:

On Stein,

Configuration on compute node A:

vcpu_pin_set=0-3 (This will report 4 VCPUs inventory in placement database)

On Train:

vcpu_pin_set=0-3

The inventory will be reported as 4 VCPUs and 4 PCPUs in the placement db

Now say user wants to create instances as below:

1. Flavor having extra specs (resources:PCPU=1), instance A 2. Flavor having extra specs (resources:VCPU=1), instance B

For both instance requests, placement will return compute Node A.

Instance A: will be pinned to say 0 CPU

Instance B: will float on 0-3

To resolve above issue, I think it’s possible to detect whether the compute node was configured to be used for pinned instances if ``NumaTopology`` ``pinned_cpus`` attribute is not empty. In that case, vcpu_pin_set will be reported as PCPU otherwise VCPU.

Regards,

-Bhagyashri Shewale-

------------------------------ *From:* Sean Mooney <smooney@redhat.com> *Sent:* Thursday, June 13, 2019 8:32:02 PM *To:* Shewale, Bhagyashri; openstack-discuss@lists.openstack.org; openstack@fried.cc; sfinucan@redhat.com; jaypipes@gmail.com *Subject:* Re: [nova] Spec: Standardize CPU resource tracking

...
Hi All,

Currently I am working on implementation of cpu pinning upgrade part as mentioned in the spec [1].

While implementing the scheduler pre-filter as mentioned in [1], I have encountered one big issue:

Proposed change in spec: In scheduler pre-filter we are going to alias request_spec.flavor.extra_spec and request_spec.image.properties form ``hw:cpu_policy`` to ``resources=(V|P)CPU:${flavor.vcpus}`` of existing instances.

So when user will create a new instance or execute instance actions

On Wed, 2019-06-12 at 09:10 +0000, Shewale, Bhagyashri wrote: like shelve, unshelve, resize, evacuate and

...
migration post upgrade it will go through scheduler pre-filter which will set alias for `hw:cpu_policy` in request_spec flavor ``extra specs`` and image metadata properties. In below particular case, it won’t work:-

For example:

I have two compute nodes say A and B:

On Stein:

Compute node A configurations:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata) vcpu_pin_set does not mean that the host was used for pinned instances https://that.guru/blog/cpu-resources/

Compute node B Configuration:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate which has “pinned” metadata)

On Train, two possible scenarios:

Compute node A configurations: (Consider the new cpu pinning implementation is merged into Train)

vcpu_pin_set=0-3 (Keep same settings as in Stein)

Compute node B Configuration: (Consider the new cpu pinning implementation is merged into Train)

cpu_dedicated_set=0-3 (change to the new config option)

1. Consider that one instance say `test ` is created using flavor having old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") in Stein release and now upgraded Nova to Train with the above configuration. 2. Now when user will perform instance action say shelve/unshelve scheduler pre-filter will change the request_spec flavor extra spec from ``hw:cpu_policy`` to ``resources=PCPU:$<no. of cpus>`` it wont remove hw:cpu_policy it will just change the resouces=VCPU:$<no. of cpus> -> resources=PCPU:$<no. of cpus>

...
which ultimately will return only compute node B from placement service. that is incorrect both a and by will be returned. the spec states that for host A we report an inventory of 4 VCPUs and an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so both host will be returned assuming $<no. of cpus> <=4

...
Here, we expect it should have retuned both Compute A and Compute B. it will 3. If user creates a new instance using old extra specs (hw:cpu_policy=dedicated, "aggregate_instance_extra_specs:pinned": "true") on Train release with the above configuration then it will return only compute node B from placement service where as it should have returned both compute Node A and B. that is what would have happend in the stien version of the spec and we changed the spec specifically to ensure that that wont happen. in the train version of the spec you will get both host as candates to prevent this upgrade impact.

Problem: As Compute node A is still configured to be used to boot instances with dedicated CPUs same behavior as Stein, it will not be returned by placement service due to the changes in the scheduler pre-filter logic.

Propose changes:

Earlier in the spec [2]: The online data migration was proposed to change flavor extra specs and image metadata properties of request_spec and instance object. Based on the instance host, we can get the NumaTopology of the host which will contain the new configuration options set on the compute host. Based on the NumaTopology of host, we can change instance and request_spec flavor extra specs.

1. Remove cpu_policy from extra specs 2. Add “resources:PCPU=<count>” in extra specs

We can also change the flavor extra specs and image metadata properties of instance and request_spec object using the reshape functionality.

Please give us your feedback on the proposed solution so that we can update specs accordingly. i am fairly stongly opposed to useing an online data migration to modify the request spec to reflect the host they landed on. this speficic problem is why the spec was changed in the train cycle to report dual inventoryis of VCPU and PCPU if vcpu_pin_set is the only option set or of no options are set.

[1]: https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources....

[2]: https://review.opendev.org/#/c/555081/23..28/specs/train/approved/cpu-resour...

Thanks and Regards,

-Bhagyashri Shewale-

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Sean Mooney

2:19 a.m.

New subject: [nova] Spec: Standardize CPU resource tracking

On Mon, 2019-06-17 at 16:45 +0800, Alex Xu wrote:

...

I'm thinking we should have recommended upgrade follow. If we give a lot of flexibility for the operator to have a lot combination of the value of vcpu_pin_set, dedicated_cpu_set and shared_cpu_set, then we have trouble in this email and have to do a lot of checks this email introduced also. we modified the spec intentionally to make upgradeing simple. i don't be believe the concerns raised in the intial 2 emails are valid if we follow what was detailed in the spec.

we did take some steps to restrict what values you can set. for example dedicated_cpu_set cannot be set if vcpu pin set is set. technicall i belive we relaxed that to say we would ignore vcpu pin set in that case be original i was pushing for it to be a hard error.

...

I'm thinking that the pre-request filter (which translates the cpu_policy=dedicated to PCPU request) should be enabled after all the node upgrades to the Train release. Before that, all the cpu_policy=dedicated instance still using the VCPU.

it should be enabled after all node are upgraded but not nessisarily before all compute nodes are updated to use dedicated_cpu_set.

...

Trying to image the upgrade as below:

1. Rolling upgrade the compute node. 2. The upgraded compute node begins to report both VCPU and PCPU, but reshape for the existed inventories. The upgraded node is still using the vcpu_pin_set config, or didn't set the vcpu_pin_config. Both in this two cases are reporting VCPU and PCPU same time. And the request with cpu_policy=dedicated still uses the VCPU. Then it is worked same as Stein release. And existed instance can be shelved/unshelved, migration and evacuate.

...

3. Disable the new request and operation for the instance to the hosts for dedicated instance. (it is kind of breaking our live-upgrade? I thought this will be a short interrupt for the control plane if that is available) im not sure why we need to do this unless you are thinging this will be done by a cli? e.g. like nova-manage. 4. reshape the inventories for existed instance for all the hosts. should this not happen when the agent starts up? 5. Enable the instance's new request and operation, also enable the pre-request filter. 6. Operator copies the value of vcpu_pin_set to dedicated_cpu_set. vcpu_pin_set is not the set of cpu used for pinning.

+1 the operators should set dedicated_cpu_set and shared_cpu_set approprealy at this point but in general they proably wont just copy it as host that used vcpu_pin_set but were not used for pinned instances will be copied to shared_cpu_set.

...

For the case of vcpu_pin_set isn't set, the value of dedicated_cpu_set should be all the cpu ids exclude shared_cpu_set if set.

Two rules at here: 1. The operator doesn't allow to change a different value for dedicated_cpu_set with vcpu_pin_set when any instance is running on the host. 2. The operator doesn't allow to change the value of dedicated_cpu_set and shared_cpu_set when any instance is running on the host. neither of these rule can be enforced. one of the requirements that dan smith had for edge computeing is that we need to supprot upgraes with instance inplace.

Shewale, Bhagyashri <Bhagyashri.Shewale@nttdata.com> 于2019年6月14日周五下午4:42写道：

...
...
...
that is incorrect both a and by will be returned. the spec states that

for host A we report an inventory of 4 VCPUs and

...
...
an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so

both host will be returned assuming

...
...
$<no. of cpus> <=4

Means if ``vcpu_pin_set`` is set in previous release then report both VCPU and PCPU as inventory (in Train) but this seems contradictory for example:

On Stein,

Configuration on compute node A:

vcpu_pin_set=0-3 (This will report 4 VCPUs inventory in placement database)

On Train:

vcpu_pin_set=0-3

The inventory will be reported as 4 VCPUs and 4 PCPUs in the placement db

Now say user wants to create instances as below:

1. Flavor having extra specs (resources:PCPU=1), instance A 2. Flavor having extra specs (resources:VCPU=1), instance B

For both instance requests, placement will return compute Node A.

Instance A: will be pinned to say 0 CPU

Instance B: will float on 0-3

To resolve above issue, I think it’s possible to detect whether the compute node was configured to be used for pinned instances if ``NumaTopology`` ``pinned_cpus`` attribute is not empty. In that case, vcpu_pin_set will be reported as PCPU otherwise VCPU.

Regards,

-Bhagyashri Shewale-

------------------------------ *From:* Sean Mooney <smooney@redhat.com> *Sent:* Thursday, June 13, 2019 8:32:02 PM *To:* Shewale, Bhagyashri; openstack-discuss@lists.openstack.org; openstack@fried.cc; sfinucan@redhat.com; jaypipes@gmail.com *Subject:* Re: [nova] Spec: Standardize CPU resource tracking

On Wed, 2019-06-12 at 09:10 +0000, Shewale, Bhagyashri wrote:

...
Hi All,

Currently I am working on implementation of cpu pinning upgrade part as

mentioned in the spec [1].

...
While implementing the scheduler pre-filter as mentioned in [1], I have

encountered one big issue:

...
Proposed change in spec: In scheduler pre-filter we are going to alias

request_spec.flavor.extra_spec and

...
request_spec.image.properties form ``hw:cpu_policy`` to

``resources=(V|P)CPU:${flavor.vcpus}`` of existing instances.

...
So when user will create a new instance or execute instance actions

like shelve, unshelve, resize, evacuate and

...
migration post upgrade it will go through scheduler pre-filter which

will set alias for `hw:cpu_policy` in

...
request_spec flavor ``extra specs`` and image metadata properties. In

below particular case, it won’t work:-

...
For example:

I have two compute nodes say A and B:

On Stein:

Compute node A configurations:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate

which has “pinned” metadata) vcpu_pin_set does not mean that the host was used for pinned instances https://that.guru/blog/cpu-resources/

...
Compute node B Configuration:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in aggregate

which has “pinned” metadata)

...
On Train, two possible scenarios:

Compute node A configurations: (Consider the new cpu pinning

implementation is merged into Train)

...
vcpu_pin_set=0-3 (Keep same settings as in Stein)

Compute node B Configuration: (Consider the new cpu pinning

implementation is merged into Train)

...
cpu_dedicated_set=0-3 (change to the new config option)

1. Consider that one instance say `test ` is created using flavor

having old extra specs (hw:cpu_policy=dedicated,

...
"aggregate_instance_extra_specs:pinned": "true") in Stein release and

now upgraded Nova to Train with the above

...
configuration. 2. Now when user will perform instance action say shelve/unshelve

scheduler pre-filter will change the

...
request_spec flavor extra spec from ``hw:cpu_policy`` to

``resources=PCPU:$<no. of cpus>`` it wont remove hw:cpu_policy it will just change the resouces=VCPU:$<no. of cpus> -> resources=PCPU:$<no. of cpus>

...
which ultimately will return only compute node B from placement service.

that is incorrect both a and by will be returned. the spec states that for host A we report an inventory of 4 VCPUs and an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so both host will be returned assuming $<no. of cpus> <=4

...
Here, we expect it should have retuned both Compute A and Compute B.

it will

...
3. If user creates a new instance using old extra specs

(hw:cpu_policy=dedicated,

...
"aggregate_instance_extra_specs:pinned": "true") on Train release with

the above configuration then it will return

...
only compute node B from placement service where as it should have

returned both compute Node A and B. that is what would have happend in the stien version of the spec and we changed the spec specifically to ensure that that wont happen. in the train version of the spec you will get both host as candates to prevent this upgrade impact.

...
Problem: As Compute node A is still configured to be used to boot

instances with dedicated CPUs same behavior as

...
Stein, it will not be returned by placement service due to the changes

in the scheduler pre-filter logic.

...
Propose changes:

Earlier in the spec [2]: The online data migration was proposed to

change flavor extra specs and image metadata

...
properties of request_spec and instance object. Based on the instance

host, we can get the NumaTopology of the host

...
which will contain the new configuration options set on the compute

host. Based on the NumaTopology of host, we can

...
change instance and request_spec flavor extra specs.

1. Remove cpu_policy from extra specs 2. Add “resources:PCPU=<count>” in extra specs

We can also change the flavor extra specs and image metadata properties

of instance and request_spec object using the

...
reshape functionality.

Please give us your feedback on the proposed solution so that we can

update specs accordingly. i am fairly stongly opposed to useing an online data migration to modify the request spec to reflect the host they landed on. this speficic problem is why the spec was changed in the train cycle to report dual inventoryis of VCPU and PCPU if vcpu_pin_set is the only option set or of no options are set.

...
[1]:

https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources....

...
[2]:

https://review.opendev.org/#/c/555081/23..28/specs/train/approved/cpu-resour...

...
Thanks and Regards,

-Bhagyashri Shewale-

Disclaimer: This email and any attachments are sent in strictest

confidence for the sole use of the addressee and may

...
contain legally privileged, confidential, and proprietary data. If you

are not the intended recipient, please advise

...
the sender by replying promptly to this email and then delete and

destroy this email and any attachments without any

...
further use, copying or forwarding.

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Alex Xu

2:47 a.m.

New subject: [nova] Spec: Standardize CPU resource tracking

Sean Mooney <smooney@redhat.com> 于2019年6月17日周一下午5:19写道：

...

On Mon, 2019-06-17 at 16:45 +0800, Alex Xu wrote:

...
I'm thinking we should have recommended upgrade follow. If we give a lot of flexibility for the operator to have a lot combination of the value of vcpu_pin_set, dedicated_cpu_set and shared_cpu_set, then we have trouble in this email and have to do a lot of checks this email introduced also. we modified the spec intentionally to make upgradeing simple. i don't be believe the concerns raised in the intial 2 emails are valid if we follow what was detailed in the spec.

...

we did take some steps to restrict what values you can set. for example dedicated_cpu_set cannot be set if vcpu pin set is set. technicall i belive we relaxed that to say we would ignore vcpu pin set in that case be original i was pushing for it to be a hard error.

...
I'm thinking that the pre-request filter (which translates the cpu_policy=dedicated to PCPU request) should be enabled after all the

node

...
upgrades to the Train release. Before that, all the cpu_policy=dedicated instance still using the VCPU. it should be enabled after all node are upgraded but not nessisarily before all compute nodes are updated to use dedicated_cpu_set.

If we enable the pre-request filter in the middle of upgrade, there will have the problem Bhagyashri said. Reporting PCPU and VCPU sametime doesn't resolve the concern from him as my understand. For example, we have 100 nodes for dedicated host in the cluster. The operator begins to upgrade the cluster. The controller plane upgrade first, and the pre-request filter enabled. For rolling upgrade, he begins to upgrade 10 nodes first. Then only those 10 nodes report PCPU and VCPU sametime. But any new request with dedicated cpu policy begins to request PCPU, all of those new instance only can be go to those 10 nodes. Also if the existed instances execute the resize and evacuate, and shelve/unshelve are going to those 10 nodes also. That is kind of nervious on the capacity at that time.

...

...
Trying to image the upgrade as below:

1. Rolling upgrade the compute node. 2. The upgraded compute node begins to report both VCPU and PCPU, but reshape for the existed inventories. The upgraded node is still using the vcpu_pin_set config, or didn't set the vcpu_pin_config. Both in this two cases are reporting VCPU and

PCPU

...
same time. And the request with cpu_policy=dedicated still uses the VCPU. Then it is worked same as Stein release. And existed instance can be shelved/unshelved, migration and evacuate. +1 3. Disable the new request and operation for the instance to the hosts for dedicated instance. (it is kind of breaking our live-upgrade? I thought this will be a short interrupt for the control plane if that is available) im not sure why we need to do this unless you are thinging this will be done by a cli? e.g. like nova-manage.

The inventories of existed instance still consumes VCPU. As we know the PCPU and VCPU reporting same time, that is kind of duplicated resources. If we begin to consume the PCPU, in the end, it will over consume the resource. yes, the disable request is done by CLI, probably disable the service.

...

...
4. reshape the inventories for existed instance for all the hosts. should this not happen when the agent starts up? 5. Enable the instance's new request and operation, also enable the pre-request filter. 6. Operator copies the value of vcpu_pin_set to dedicated_cpu_set. vcpu_pin_set is not the set of cpu used for pinning. the operators should set dedicated_cpu_set and shared_cpu_set approprealy at this point but in general they proably wont just copy it as host that used vcpu_pin_set but were not used for pinned instances will be copied to shared_cpu_set.

Yes, I should say this upgrade flow is for those dedicated instance host. For the host only running floating instance, they doesn't have trouble with those problem.

...

...
For the case of vcpu_pin_set isn't set, the value of dedicated_cpu_set should be all the cpu ids exclude shared_cpu_set if set.

Two rules at here: 1. The operator doesn't allow to change a different value for dedicated_cpu_set with vcpu_pin_set when any instance is running on the host. 2. The operator doesn't allow to change the value of dedicated_cpu_set and shared_cpu_set when any instance is running on the host. neither of these rule can be enforced. one of the requirements that dan smith had for edge computeing is that we need to supprot upgraes with instance inplace.

Shewale, Bhagyashri <Bhagyashri.Shewale@nttdata.com> 于2019年6月14日周五下午4:42写道：

...
...
...
that is incorrect both a and by will be returned. the spec states that

for host A we report an inventory of 4 VCPUs and

...
...
an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so

both host will be returned assuming

...
...
$<no. of cpus> <=4

Means if ``vcpu_pin_set`` is set in previous release then report both VCPU and PCPU as inventory (in Train) but this seems contradictory for example:

On Stein,

Configuration on compute node A:

vcpu_pin_set=0-3 (This will report 4 VCPUs inventory in placement database)

On Train:

vcpu_pin_set=0-3

The inventory will be reported as 4 VCPUs and 4 PCPUs in the placement db

Now say user wants to create instances as below:

1. Flavor having extra specs (resources:PCPU=1), instance A 2. Flavor having extra specs (resources:VCPU=1), instance B

For both instance requests, placement will return compute Node A.

Instance A: will be pinned to say 0 CPU

Instance B: will float on 0-3

To resolve above issue, I think it’s possible to detect whether the compute node was configured to be used for pinned instances if ``NumaTopology`` ``pinned_cpus`` attribute is not empty. In that case, vcpu_pin_set will be reported as PCPU otherwise VCPU.

Regards,

-Bhagyashri Shewale-

------------------------------ *From:* Sean Mooney <smooney@redhat.com> *Sent:* Thursday, June 13, 2019 8:32:02 PM *To:* Shewale, Bhagyashri; openstack-discuss@lists.openstack.org; openstack@fried.cc; sfinucan@redhat.com; jaypipes@gmail.com *Subject:* Re: [nova] Spec: Standardize CPU resource tracking

On Wed, 2019-06-12 at 09:10 +0000, Shewale, Bhagyashri wrote:

...
Hi All,

Currently I am working on implementation of cpu pinning upgrade part as

mentioned in the spec [1].

...
While implementing the scheduler pre-filter as mentioned in [1], I

have

encountered one big issue:

...
Proposed change in spec: In scheduler pre-filter we are going to

alias

request_spec.flavor.extra_spec and

...
request_spec.image.properties form ``hw:cpu_policy`` to

``resources=(V|P)CPU:${flavor.vcpus}`` of existing instances.

...
So when user will create a new instance or execute instance actions

like shelve, unshelve, resize, evacuate and

...
migration post upgrade it will go through scheduler pre-filter which

will set alias for `hw:cpu_policy` in

...
request_spec flavor ``extra specs`` and image metadata properties. In

below particular case, it won’t work:-

...
For example:

I have two compute nodes say A and B:

On Stein:

Compute node A configurations:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in

aggregate

which has “pinned” metadata) vcpu_pin_set does not mean that the host was used for pinned instances https://that.guru/blog/cpu-resources/

...
Compute node B Configuration:

vcpu_pin_set=0-3 (used as dedicated CPU, This host is added in

aggregate

which has “pinned” metadata)

...
On Train, two possible scenarios:

Compute node A configurations: (Consider the new cpu pinning

implementation is merged into Train)

...
vcpu_pin_set=0-3 (Keep same settings as in Stein)

Compute node B Configuration: (Consider the new cpu pinning

implementation is merged into Train)

...
cpu_dedicated_set=0-3 (change to the new config option)

1. Consider that one instance say `test ` is created using flavor

having old extra specs (hw:cpu_policy=dedicated,

...
"aggregate_instance_extra_specs:pinned": "true") in Stein release and

now upgraded Nova to Train with the above

...
configuration. 2. Now when user will perform instance action say shelve/unshelve

scheduler pre-filter will change the

...
request_spec flavor extra spec from ``hw:cpu_policy`` to

``resources=PCPU:$<no. of cpus>`` it wont remove hw:cpu_policy it will just change the resouces=VCPU:$<no. of cpus> -> resources=PCPU:$<no. of cpus>

...
which ultimately will return only compute node B from placement service.

that is incorrect both a and by will be returned. the spec states that for host A we report an inventory of 4 VCPUs and an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so both host will be returned assuming $<no. of cpus> <=4

...
Here, we expect it should have retuned both Compute A and Compute B.

it will

...
3. If user creates a new instance using old extra specs

(hw:cpu_policy=dedicated,

...
"aggregate_instance_extra_specs:pinned": "true") on Train release with

the above configuration then it will return

...
only compute node B from placement service where as it should have

returned both compute Node A and B. that is what would have happend in the stien version of the spec and we changed the spec specifically to ensure that that wont happen. in the train version of the spec you will get both host as candates to prevent this upgrade impact.

...
Problem: As Compute node A is still configured to be used to boot

instances with dedicated CPUs same behavior as

...
Stein, it will not be returned by placement service due to the changes

in the scheduler pre-filter logic.

...
Propose changes:

Earlier in the spec [2]: The online data migration was proposed to

change flavor extra specs and image metadata

...
properties of request_spec and instance object. Based on the instance

host, we can get the NumaTopology of the host

...
which will contain the new configuration options set on the compute

...
change instance and request_spec flavor extra specs.

1. Remove cpu_policy from extra specs 2. Add “resources:PCPU=<count>” in extra specs

We can also change the flavor extra specs and image metadata

host. Based on the NumaTopology of host, we can properties

of instance and request_spec object using the

...
reshape functionality.

Please give us your feedback on the proposed solution so that we can

update specs accordingly. i am fairly stongly opposed to useing an online data migration to modify the request spec to reflect the host they landed on. this speficic problem is why the spec was changed in the train cycle to report dual inventoryis of VCPU and PCPU if vcpu_pin_set is the only option set or of no options are set.

...
[1]:

https://review.opendev.org/#/c/555081/28/specs/train/approved/cpu-resources....

...
...
[2]:

https://review.opendev.org/#/c/555081/23..28/specs/train/approved/cpu-resour...

...
...
Thanks and Regards,

-Bhagyashri Shewale-

Disclaimer: This email and any attachments are sent in strictest

confidence for the sole use of the addressee and may

...
contain legally privileged, confidential, and proprietary data. If you

are not the intended recipient, please advise

...
the sender by replying promptly to this email and then delete and

destroy this email and any attachments without any

...
further use, copying or forwarding.

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Stephen Finucane

5:47 a.m.

New subject: [nova] Spec: Standardize CPU resource tracking

On Mon, 2019-06-17 at 17:47 +0800, Alex Xu wrote:

...

Sean Mooney <smooney@redhat.com> 于2019年6月17日周一下午5:19写道：

...
On Mon, 2019-06-17 at 16:45 +0800, Alex Xu wrote:

...
I'm thinking we should have recommended upgrade follow. If we give a lot of flexibility for the operator to have a lot combination of the value of vcpu_pin_set, dedicated_cpu_set and shared_cpu_set, then we have trouble in this email and have to do a lot of checks this email introduced also.

we modified the spec intentionally to make upgradeing simple. i don't be believe the concerns raised in the intial 2 emails are valid if we follow what was detailed in the spec. we did take some steps to restrict what values you can set. for example dedicated_cpu_set cannot be set if vcpu pin set is set. technicall i belive we relaxed that to say we would ignore vcpu pin set in that case be original i was pushing for it to be a hard error.

...
I'm thinking that the pre-request filter (which translates the cpu_policy=dedicated to PCPU request) should be enabled after all the node upgrades to the Train release. Before that, all the cpu_policy=dedicated instance still using the VCPU.

it should be enabled after all node are upgraded but not nessisarily before all compute nodes are updated to use dedicated_cpu_set.

If we enable the pre-request filter in the middle of upgrade, there will have the problem Bhagyashri said. Reporting PCPU and VCPU sametime doesn't resolve the concern from him as my understand.

For example, we have 100 nodes for dedicated host in the cluster.

The operator begins to upgrade the cluster. The controller plane upgrade first, and the pre-request filter enabled. For rolling upgrade, he begins to upgrade 10 nodes first. Then only those 10 nodes report PCPU and VCPU sametime. But any new request with dedicated cpu policy begins to request PCPU, all of those new instance only can be go to those 10 nodes. Also if the existed instances execute the resize and evacuate, and shelve/unshelve are going to those 10 nodes also. That is kind of nervious on the capacity at that time.

The exact same issue can happen the other way around. As an operator slowly starts upgrading, by setting the necessary configuration options, the compute nodes will reduce the VCPU inventory they report and start reporting PCPU inventory. Using the above example, if we upgraded 90 of the 100 compute nodes and didn't enable the prefilter, we would only be able to schedule to one of the remaining 10 nodes. This doesn't seem any better. At some point we're going to need to make a clean break from pinned instances consuming VCPU resources to them using PCPU resources. When that happens is up to us. I figured it was easiest to do this as soon as the controllers were updated because I had assumed compute nodes would be updated pretty soon after the controllers and therefore there would only be a short window where instances would start requesting PCPU resources but there wouldn't be any available. Maybe that doesn't make sense though. If not, I guess we need to make this configurable. I propose that as soon as compute nodes are upgraded then they will all start reporting PCPU inventory, as noted in the spec. However, the prefilter will initially be disabled and we will not reshape existing inventories. This means pinned instances will continue consuming VCPU resources as before but that is not an issue since this is the behavior we currently have. Once the operator is happy that all of the compute nodes have been upgraded, or at least enough that they care about, we will then need some way for us to switch on the prefilter and reshape existing instances. Perhaps this would require manual configuration changes, validated by an upgrade check, or perhaps we could add a workaround config option? In any case, at some point we need to have a switch from "use VCPUs for pinned instances" to "use PCPUs for pinned instances". Stephen

...

...
...
Trying to image the upgrade as below:

1. Rolling upgrade the compute node. 2. The upgraded compute node begins to report both VCPU and PCPU, but reshape for the existed inventories. The upgraded node is still using the vcpu_pin_set config, or didn't set the vcpu_pin_config. Both in this two cases are reporting VCPU and PCPU same time. And the request with cpu_policy=dedicated still uses the VCPU. Then it is worked same as Stein release. And existed instance can be shelved/unshelved, migration and evacuate.

+1

...
3. Disable the new request and operation for the instance to the hosts for dedicated instance. (it is kind of breaking our live- upgrade? I thought this will be a short interrupt for the control plane if that is available)

im not sure why we need to do this unless you are thinging this will be done by a cli? e.g. like nova-manage.

The inventories of existed instance still consumes VCPU. As we know the PCPU and VCPU reporting same time, that is kind of duplicated resources. If we begin to consume the PCPU, in the end, it will over consume the resource.

yes, the disable request is done by CLI, probably disable the service.

...
...
4. reshape the inventories for existed instance for all the hosts.

should this not happen when the agent starts up?

...
5. Enable the instance's new request and operation, also enable the pre-request filter. 6. Operator copies the value of vcpu_pin_set to dedicated_cpu_set.

vcpu_pin_set is not the set of cpu used for pinning. the operators should set dedicated_cpu_set and shared_cpu_set approprealy at this point but in general they proably wont just copy it as host that used vcpu_pin_set but were not used for pinned instances will be copied to shared_cpu_set.

Yes, I should say this upgrade flow is for those dedicated instance host. For the host only running floating instance, they doesn't have trouble with those problem.

...
...
For the case of vcpu_pin_set isn't set, the value of dedicated_cpu_set should be all the cpu ids exclude shared_cpu_set if set.

Two rules at here: 1. The operator doesn't allow to change a different value for dedicated_cpu_set with vcpu_pin_set when any instance is running on the host. 2. The operator doesn't allow to change the value of dedicated_cpu_set and shared_cpu_set when any instance is running on the host.

neither of these rule can be enforced. one of the requirements that dan smith had for edge computeing is that we need to supprot upgraes with instance inplace.

Alex Xu

18 Jun 18 Jun

12:57 a.m.

New subject: [nova] Spec: Standardize CPU resource tracking

Stephen Finucane <sfinucan@redhat.com> 于2019年6月17日周一下午8:47写道：

...

On Mon, 2019-06-17 at 17:47 +0800, Alex Xu wrote:

...
Sean Mooney <smooney@redhat.com> 于2019年6月17日周一下午5:19写道：

...
On Mon, 2019-06-17 at 16:45 +0800, Alex Xu wrote:

...
I'm thinking we should have recommended upgrade follow. If we give a lot of flexibility for the operator to have a lot combination of the value of vcpu_pin_set, dedicated_cpu_set and shared_cpu_set, then we have trouble in this email and have to do a lot of checks this email introduced also.

we modified the spec intentionally to make upgradeing simple. i don't be believe the concerns raised in the intial 2 emails are valid if we follow what was detailed in the spec. we did take some steps to restrict what values you can set. for example dedicated_cpu_set cannot be set if vcpu pin set is set. technicall i belive we relaxed that to say we would ignore vcpu pin set in that case be original i was pushing for it to be a hard error.

...
I'm thinking that the pre-request filter (which translates the cpu_policy=dedicated to PCPU request) should be enabled after all the node upgrades to the Train release. Before that, all the cpu_policy=dedicated instance still using the VCPU.

it should be enabled after all node are upgraded but not nessisarily before all compute nodes are updated to use dedicated_cpu_set.

If we enable the pre-request filter in the middle of upgrade, there will have the problem Bhagyashri said. Reporting PCPU and VCPU sametime doesn't resolve the concern from him as my understand.

For example, we have 100 nodes for dedicated host in the cluster.

The operator begins to upgrade the cluster. The controller plane upgrade first, and the pre-request filter enabled. For rolling upgrade, he begins to upgrade 10 nodes first. Then only those 10 nodes report PCPU and VCPU sametime. But any new request with dedicated cpu policy begins to request PCPU, all of those new instance only can be go to those 10 nodes. Also if the existed instances execute the resize and evacuate, and shelve/unshelve are going to those 10 nodes also. That is kind of nervious on the capacity at that time.

The exact same issue can happen the other way around. As an operator slowly starts upgrading, by setting the necessary configuration options, the compute nodes will reduce the VCPU inventory they report and start reporting PCPU inventory. Using the above example, if we upgraded 90 of the 100 compute nodes and didn't enable the prefilter, we would only be able to schedule to one of the remaining 10 nodes. This doesn't seem any better.

At some point we're going to need to make a clean break from pinned instances consuming VCPU resources to them using PCPU resources. When that happens is up to us. I figured it was easiest to do this as soon as the controllers were updated because I had assumed compute nodes would be updated pretty soon after the controllers and therefore there would only be a short window where instances would start requesting PCPU resources but there wouldn't be any available. Maybe that doesn't make sense though. If not, I guess we need to make this configurable.

I propose that as soon as compute nodes are upgraded then they will all start reporting PCPU inventory, as noted in the spec. However, the prefilter will initially be disabled and we will not reshape existing inventories. This means pinned instances will continue consuming VCPU resources as before but that is not an issue since this is the behavior we currently have. Once the operator is happy that all of the compute nodes have been upgraded, or at least enough that they care about, we will then need some way for us to switch on the prefilter and reshape existing instances. Perhaps this would require manual configuration changes, validated by an upgrade check, or perhaps we could add a workaround config option?

In any case, at some point we need to have a switch from "use VCPUs for pinned instances" to "use PCPUs for pinned instances".

All agree, we are talking about the same thing. This is the upgrade step I write below. I didn't see the spec describe those steps clearly or I miss something.

...

Stephen

...
...
...
Trying to image the upgrade as below:

1. Rolling upgrade the compute node. 2. The upgraded compute node begins to report both VCPU and PCPU, but reshape for the existed inventories. The upgraded node is still using the vcpu_pin_set config, or didn't set the vcpu_pin_config. Both in this two cases are reporting VCPU and PCPU same time. And the request with cpu_policy=dedicated still uses the VCPU. Then it is worked same as Stein release. And existed instance can be shelved/unshelved, migration and evacuate.

+1

...
3. Disable the new request and operation for the instance to the hosts for dedicated instance. (it is kind of breaking our live- upgrade? I thought this will be a short interrupt for the control plane if that is available)

im not sure why we need to do this unless you are thinging this will be done by a cli? e.g. like nova-manage.

The inventories of existed instance still consumes VCPU. As we know the PCPU and VCPU reporting same time, that is kind of duplicated resources. If we begin to consume the PCPU, in the end, it will over consume the resource.

yes, the disable request is done by CLI, probably disable the service.

...
...
4. reshape the inventories for existed instance for all the hosts.

should this not happen when the agent starts up?

...
5. Enable the instance's new request and operation, also enable the pre-request filter. 6. Operator copies the value of vcpu_pin_set to dedicated_cpu_set.

vcpu_pin_set is not the set of cpu used for pinning. the operators should set dedicated_cpu_set and shared_cpu_set approprealy at this point but in general they proably wont just copy it as host that used vcpu_pin_set but were not used for pinned instances will be copied to shared_cpu_set.

Yes, I should say this upgrade flow is for those dedicated instance host. For the host only running floating instance, they doesn't have trouble with those problem.

...
...
For the case of vcpu_pin_set isn't set, the value of dedicated_cpu_set should be all the cpu ids exclude shared_cpu_set if set.

Two rules at here: 1. The operator doesn't allow to change a different value for dedicated_cpu_set with vcpu_pin_set when any instance is running on the host. 2. The operator doesn't allow to change the value of dedicated_cpu_set and shared_cpu_set when any instance is running on the host.

neither of these rule can be enforced. one of the requirements that dan smith had for edge computeing is that we need to supprot upgraes with instance inplace.

Stephen Finucane

17 Jun 17 Jun

3:19 a.m.

On Fri, 2019-06-14 at 08:37 +0000, Shewale, Bhagyashri wrote:

...

...
...
that is incorrect both a and by will be returned. the spec states that for host A we report an inventory of 4 VCPUs and an inventory of 4 PCPUs and host B will have 1 inventory of 4 PCPUs so both host will be returned assuming $<no. of cpus> <=4

Means if ``vcpu_pin_set`` is set in previous release then report both VCPU and PCPU as inventory (in Train) but this seems contradictory for example:

On Stein,

Configuration on compute node A: vcpu_pin_set=0-3 (This will report 4 VCPUs inventory in placement database)

On Train: vcpu_pin_set=0-3

The inventory will be reported as 4 VCPUs and 4 PCPUs in the placement db

Now say user wants to create instances as below: Flavor having extra specs (resources:PCPU=1), instance A Flavor having extra specs (resources:VCPU=1), instance B

For both instance requests, placement will return compute Node A. Instance A: will be pinned to say 0 CPU Instance B: will float on 0-3

This is not a serious issue. This is very similar to what will happen today if you don't use host aggregrates to isolate NUMA-based instances from non-NUMA-based instances. If you can assume that operators are using host aggregates to separate pinned and unpinned instance, then the VCPU inventory of a host in the 'pinned' aggregrate will never be consumed and vice versa.

...

To resolve above issue, I think it’s possible to detect whether the compute node was configured to be used for pinned instances if ``NumaTopology`` ``pinned_cpus`` attribute is not empty. In that case, vcpu_pin_set will be reported as PCPU otherwise VCPU.

This only works if the host already has instances on it. If you've a deployment with 100 hosts and 82 of them have instances on there at the time of upgrade, then 82 will start reporting PCPU inventory and 18 will continue reporting just VCPU inventory. We thought long and hard about this and there is no good heuristic we can use to separate hosts that should report PCPUs from those that should report VCPUs. That's why we said we'll report both and hope that host aggregrates are configured correctly. If host aggregrates aren't configured, then things are no more broken than before but at least the operator will now get warnings (above missing 'cpu_dedicated_set' options). As before, please push some of this code up so we can start reviewing it. Stephen

2248

Age (days ago)

2255

Last active (days ago)

List overview

Download

19 comments

4 participants

participants (4)

Alex Xu
Sean Mooney
Shewale, Bhagyashri
Stephen Finucane

[nova] Spec: Standardize CPU resource tracking

tags

participants (4)