I'm following this document to setup CPU pinning on Rocky:
https://www.redhat.com/en/blog/driving-fast-lane-cpu-pinning-and-numa-topolo...
I followed all of the steps except for modifying non-pinned flavors and I have one aggregate containing a single NUMA-capable host:
root@us01odc-dev1-ctrl1:/var/log/nova# os aggregate list +----+-------+-------------------+ | ID | Name | Availability Zone | +----+-------+-------------------+ | 4 | perf3 | None | +----+-------+-------------------+ root@us01odc-dev1-ctrl1:/var/log/nova# os aggregate show 4 +-------------------+----------------------------+ | Field | Value | +-------------------+----------------------------+ | availability_zone | None | | created_at | 2019-10-30T23:05:41.000000 | | deleted | False | | deleted_at | None | | hosts | [u'us01odc-dev1-hv003'] | | id | 4 | | name | perf3 | | properties | pinned='true' | | updated_at | None | +-------------------+----------------------------+
I have a flavor with the NUMA properties:
root@us01odc-dev1-ctrl1:/var/log/nova# os flavor show s1.perf3 +----------------------------+-------------------------------------------------------------------------+ | Field | Value | +----------------------------+-------------------------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 35 | | id | be3d21c4-7e91-42a2-b832-47f42fdd3907 | | name | s1.perf3 | | os-flavor-access:is_public | True | | properties | aggregate_instance_extra_specs:pinned='true', hw:cpu_policy='dedicated' | | ram | 30720 | | rxtx_factor | 1.0 | | swap | 7168 | | vcpus | 4 | +----------------------------+-------------------------------------------------------------------------+
I create a VM with that flavor:
openstack server create --flavor s1.perf3 --image NOT-QSC-CentOS6.10-19P1-v4 --network it-network alberttest4
but it goes to error status, and I see this in the logs:
2019-10-30 16:17:55.590 3248800 INFO nova.virt.hardware [req-d0c2de13-db23-41bd-8da3-34c68ff1d998 2cb6757679d54a69803a5b6e317b3a93 474ae347d8ad426f8118e55eee47dcfd - default 7d3a4deab35b434bba403100a6729c81] Computed NUMA topology CPU pinning: usable pCPUs: [[4], [5], [6], [7]], vCPUs mapping: [(0, 4), (1, 5), (2, 6), (3, 7)] 2019-10-30 16:17:55.595 3248800 INFO nova.virt.hardware [req-d0c2de13-db23-41bd-8da3-34c68ff1d998 2cb6757679d54a69803a5b6e317b3a93 474ae347d8ad426f8118e55eee47dcfd - default 7d3a4deab35b434bba403100a6729c81] Computed NUMA topology CPU pinning: usable pCPUs: [[0], [1], [2], [3], [4], [5], [6], [7]], vCPUs mapping: [(0, 0), (1, 1), (2, 2), (3, 3)] 2019-10-30 16:17:55.595 3248800 INFO nova.filters [req-d0c2de13-db23-41bd-8da3-34c68ff1d998 2cb6757679d54a69803a5b6e317b3a93 474ae347d8ad426f8118e55eee47dcfd - default 7d3a4deab35b434bba403100a6729c81] Filter AggregateInstanceExtraSpecsFilter returned 0 hosts 2019-10-30 16:17:55.596 3248800 INFO nova.filters [req-d0c2de13-db23-41bd-8da3-34c68ff1d998 2cb6757679d54a69803a5b6e317b3a93 474ae347d8ad426f8118e55eee47dcfd - default 7d3a4deab35b434bba403100a6729c81] Filtering removed all hosts for the request with instance ID '73b1e584-0ce4-478c-a706-c5892609dc3f'. Filter results: ['RetryFilter: (start: 3, end: 3)', 'AvailabilityZoneFilter: (start: 3, end: 3)', 'CoreFilter: (start: 3, end: 2)', 'RamFilter: (start: 2, end: 2)', 'ComputeFilter: (start: 2, end: 2)', 'ComputeCapabilitiesFilter: (start: 2, end: 2)', 'ImagePropertiesFilter: (start: 2, end: 2)', 'ServerGroupAntiAffinityFilter: (start: 2, end: 2)', 'ServerGroupAffinityFilter: (start: 2, end: 2)', 'DifferentHostFilter: (start: 2, end: 2)', 'SameHostFilter: (start: 2, end: 2)', 'NUMATopologyFilter: (start: 2, end: 2)', 'AggregateInstanceExtraSpecsFilter: (start: 2, end: 0)']
It looks like my hypervisor is passing the hw:cpu_policy='dedicated' requirement but it is failing on "pinned=true"
The interesting part of the problem is that if I add a second apparently identical hypervisor to the aggregate it starts working. I create s1.perf3 VMs and they land on us01odc-dev1-hv002 and the XML shows that they are correctly pinned. When us01odc-dev1-hv002 is full then they start failing again.
What should I be looking for here? What could cause one apparently identical hypervisor to fail AggregateInstanceExtraSpecsFilter while another one passes?
In the nova-compute log of the failing hypervisor I see this:
2019-10-31 10:43:01.147 1103 INFO nova.compute.resource_tracker [req-dda65a9c-9d0a-4888-b4cb-0bf4423dd2f3 - - - - -] Instance 4856d505-c220-4873-b881-836b5b75f7bb has allocations against this compute host but is not found in the database. 2019-10-31 10:43:01.148 1103 INFO nova.compute.resource_tracker [req-dda65a9c-9d0a-4888-b4cb-0bf4423dd2f3 - - - - -] Final resource view: name=us01odc-dev1-hv003.internal.synopsys.com phys_ram=128888MB used_ram=38912MB phys_disk=1208GB used_disk=297GB total_vcpus=8 used_vcpus=6 pci_stats=[]
Openstack can't find a VM with UUID 4856d505-c220-4873-b881-836b5b75f7bb. There are no VMs on hv003 but I can create a non-pinned VM there and it works. Do I have a "phantom" VM that is consuming resources on hv003? How can I fix that?
I found the offending UUID in the nova_api and placement databases. Do I need to delete these entries from the DB or is there a safer way to get rid of the "phantom" VM?
MariaDB [(none)]> select * from nova_api.instance_mappings where instance_uuid = '4856d505-c220-4873-b881-836b5b75f7bb'; | created_at | updated_at | id | instance_uuid | cell_id | project_id | queued_for_delete | | 2019-10-08 21:26:03 | NULL | 589 | 4856d505-c220-4873-b881-836b5b75f7bb | NULL | 474ae347d8ad426f8118e55eee47dcfd | 0 |
MariaDB [(none)]> select * from nova_api.request_specs where instance_uuid = '4856d505-c220-4873-b881-836b5b75f7bb'; | created_at | updated_at | id | instance_uuid | spec | | 2019-10-08 21:26:03 | NULL | 589 | 4856d505-c220-4873-b881-836b5b75f7bb | {"nova_object.version": "1.11", "nova_object.changes": ["requested_destination", "instance_uuid", "retry", "num_instances", "pci_requests", "limits", "availability_zone", "force_nodes", "image", "instance_group", "force_hosts", "ignore_hosts", "numa_topology", "is_bfv", "user_id", "flavor", "project_id", "security_groups", "scheduler_hints"], "nova_object.name": "RequestSpec", "nova_object.data": {"requested_destination": null, "instance_uuid": "4856d505-c220-4873-b881-836b5b75f7bb", "retry": null, "num_instances": 1, "pci_requests": {"nova_object.version": "1.1", "nova_object.changes": ["requests"], "nova_object.name": "InstancePCIRequests", "nova_object.data": {"requests": []}, "nova_object.namespace": "nova"}, "limits": {"nova_object.version": "1.0", "nova_object.changes": ["vcpu", "memory_mb", "disk_gb", "numa_topology"], "nova_object.name": "SchedulerLimits", "nova_object.data": {"vcpu": null, "memory_mb": null, "disk_gb": null, "numa_topology": null}, "nova_object.namespace": "nova"}, "availability_zone": null, "force_nodes": null, "image": {"nova_object.version": "1.8", "nova_object.changes": ["status", "name", "container_format", "created_at", "disk_format", "updated_at", "id", "min_disk", "min_ram", "checksum", "owner", "properties", "size"], "nova_object.name": "ImageMeta", "nova_object.data": {"status": "active", "created_at": "2019-10-02T01:10:04Z", "name": "QSC-P-CentOS6.6-19P1-v4", "container_format": "bare", "min_ram": 0, "disk_format": "qcow2", "updated_at": "2019-10-02T01:10:44Z", "id": "200cb134-2716-4662-8183-33642078547f", "min_disk": 0, "checksum": "94d33caafd85b45519fca331ee7ea03e", "owner": "474ae347d8ad426f8118e55eee47dcfd", "properties": {"nova_object.version": "1.20", "nova_object.name": "ImageMetaProps", "nova_object.data": {}, "nova_object.namespace": "nova"}, "size": 4935843840}, "nova_object.namespace": "nova"}, "instance_group": null, "force_hosts": null, "ignore_hosts": null, "numa_topology": null, "is_bfv": false, "user_id": "2cb6757679d54a69803a5b6e317b3a93", "flavor": {"nova_object.version": "1.2", "nova_object.name": "Flavor", "nova_object.data": {"disabled": false, "root_gb": 35, "description": null, "flavorid": "e8b42da7-d352-441e-b494-77d6a6cd7366", "deleted": false, "created_at": "2019-09-23T21:19:50Z", "ephemeral_gb": 10, "updated_at": null, "memory_mb": 4096, "vcpus": 1, "extra_specs": {}, "swap": 3072, "rxtx_factor": 1.0, "is_public": true, "deleted_at": null, "vcpu_weight": 0, "id": 2, "name": "s1.1cx4g"}, "nova_object.namespace": "nova"}, "project_id": "474ae347d8ad426f8118e55eee47dcfd", "security_groups": {"nova_object.version": "1.1", "nova_object.changes": ["objects"], "nova_object.name": "SecurityGroupList", "nova_object.data": {"objects": [{"nova_object.version": "1.2", "nova_object.changes": ["name"], "nova_object.name": "SecurityGroup", "nova_object.data": {"name": "default"}, "nova_object.namespace": "nova"}]}, "nova_object.namespace": "nova"}, "scheduler_hints": {}}, "nova_object.namespace": "nova"} | 1 row in set (0.001 sec)
MariaDB [(none)]> SELECT * FROM placement.allocations WHERE consumer_id = '4856d505-c220-4873-b881-836b5b75f7bb'; | created_at | updated_at | id | resource_provider_id | consumer_id | resource_class_id | used | | 2019-10-08 22:03:33 | NULL | 3073 | 1024 | 4856d505-c220-4873-b881-836b5b75f7bb | 0 | 1 | | 2019-10-08 22:03:33 | NULL | 3074 | 1024 | 4856d505-c220-4873-b881-836b5b75f7bb | 1 | 4096 | | 2019-10-08 22:03:33 | NULL | 3075 | 1024 | 4856d505-c220-4873-b881-836b5b75f7bb | 2 | 48 | 3 rows in set (0.001 sec)
MariaDB [(none)]> SELECT * FROM placement.consumers WHERE uuid = '4856d505-c220-4873-b881-836b5b75f7bb'; | created_at | updated_at | id | uuid | project_id | user_id | generation | | 2019-10-08 22:03:33 | 2019-10-08 22:03:33 | 734 | 4856d505-c220-4873-b881-836b5b75f7bb | 1 | 1 | 1 | 1 row in set (0.000 sec)
From: Albert Braden <Albert.Braden@synopsys.commailto:Albert.Braden@synopsys.com> Sent: Thursday, October 31, 2019 10:50 AM To: openstack-discuss@lists.openstack.orgmailto:openstack-discuss@lists.openstack.org Subject: CPU pinning blues
I'm following this document to setup CPU pinning on Rocky:
https://www.redhat.com/en/blog/driving-fast-lane-cpu-pinning-and-numa-topolo...
I followed all of the steps except for modifying non-pinned flavors and I have one aggregate containing a single NUMA-capable host:
root@us01odc-dev1-ctrl1:/var/log/nova# os aggregate list +----+-------+-------------------+ | ID | Name | Availability Zone | +----+-------+-------------------+ | 4 | perf3 | None | +----+-------+-------------------+ root@us01odc-dev1-ctrl1:/var/log/nova# os aggregate show 4 +-------------------+----------------------------+ | Field | Value | +-------------------+----------------------------+ | availability_zone | None | | created_at | 2019-10-30T23:05:41.000000 | | deleted | False | | deleted_at | None | | hosts | [u'us01odc-dev1-hv003'] | | id | 4 | | name | perf3 | | properties | pinned='true' | | updated_at | None | +-------------------+----------------------------+
I have a flavor with the NUMA properties:
root@us01odc-dev1-ctrl1:/var/log/nova# os flavor show s1.perf3 +----------------------------+-------------------------------------------------------------------------+ | Field | Value | +----------------------------+-------------------------------------------------------------------------+ | OS-FLV-DISABLED:disabled | False | | OS-FLV-EXT-DATA:ephemeral | 0 | | access_project_ids | None | | disk | 35 | | id | be3d21c4-7e91-42a2-b832-47f42fdd3907 | | name | s1.perf3 | | os-flavor-access:is_public | True | | properties | aggregate_instance_extra_specs:pinned='true', hw:cpu_policy='dedicated' | | ram | 30720 | | rxtx_factor | 1.0 | | swap | 7168 | | vcpus | 4 | +----------------------------+-------------------------------------------------------------------------+
I create a VM with that flavor:
openstack server create --flavor s1.perf3 --image NOT-QSC-CentOS6.10-19P1-v4 --network it-network alberttest4
but it goes to error status, and I see this in the logs: ***
*** Post with logs got moderated so they are here:
https://paste.fedoraproject.org/paste/3bza6CJstXFPy8LatRJruA
On 11/5/2019 2:11 PM, Albert Braden wrote:
I found the offending UUID in the nova_api and placement databases. Do I need to delete these entries from the DB or is there a safer way to get rid of the “phantom” VM?
MariaDB [(none)]> select * from nova_api.instance_mappings where instance_uuid = '4856d505-c220-4873-b881-836b5b75f7bb';
| created_at | updated_at | id | instance_uuid | cell_id | project_id | queued_for_delete |
| 2019-10-08 21:26:03 | NULL | 589 | 4856d505-c220-4873-b881-836b5b75f7bb | NULL | 474ae347d8ad426f8118e55eee47dcfd | 0 |
Interesting. So there is an instance mapping but it's not pointing at any cell. I'm assuming there is no entry for this instance in the nova_api.build_requests table either?
A couple of related patches for that instance mapping thing:
1. I have a patch that adds a nova-manage command to cleanup busted instance mappings [1]. In this case you'd just --purge that broken instance mapping.
2. mnaser has reported similar weird issues where an instance mapping exists but doesn't point at a cell and the build request is gone and the instance isn't in cell0. For that we have a sanity check patch [2] which might be helpful to you if you hit this again.
If either of those patches are helpful to you, please vote on the changes so we can draw some more eyes to the reviews.
As for the allocations, you can remove those from placement using the osc-placement CLI plugin [3]:
openstack resource provider allocation delete 4856d505-c220-4873-b881-836b5b75f7bb
[1] https://review.opendev.org/#/c/655908/ [2] https://review.opendev.org/#/c/683730/ [3] https://docs.openstack.org/osc-placement/latest/cli/index.html#resource-prov...
Thanks Matt! I saw your "any interest" email earlier and tried that procedure, and it fixed the problem.
-----Original Message----- From: Matt Riedemann mriedemos@gmail.com Sent: Tuesday, November 5, 2019 2:45 PM To: openstack-discuss@lists.openstack.org Subject: Re: CPU pinning blues
On 11/5/2019 2:11 PM, Albert Braden wrote:
I found the offending UUID in the nova_api and placement databases. Do I need to delete these entries from the DB or is there a safer way to get rid of the "phantom" VM?
MariaDB [(none)]> select * from nova_api.instance_mappings where instance_uuid = '4856d505-c220-4873-b881-836b5b75f7bb';
| created_at | updated_at | id | instance_uuid | cell_id | project_id | queued_for_delete |
| 2019-10-08 21:26:03 | NULL | 589 | 4856d505-c220-4873-b881-836b5b75f7bb | NULL | 474ae347d8ad426f8118e55eee47dcfd | 0 |
Interesting. So there is an instance mapping but it's not pointing at any cell. I'm assuming there is no entry for this instance in the nova_api.build_requests table either?
A couple of related patches for that instance mapping thing:
1. I have a patch that adds a nova-manage command to cleanup busted instance mappings [1]. In this case you'd just --purge that broken instance mapping.
2. mnaser has reported similar weird issues where an instance mapping exists but doesn't point at a cell and the build request is gone and the instance isn't in cell0. For that we have a sanity check patch [2] which might be helpful to you if you hit this again.
If either of those patches are helpful to you, please vote on the changes so we can draw some more eyes to the reviews.
As for the allocations, you can remove those from placement using the osc-placement CLI plugin [3]:
openstack resource provider allocation delete 4856d505-c220-4873-b881-836b5b75f7bb
[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opendev.org_-23_... [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opendev.org_-23_... [3] https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.openstack.org_osc-...
Will these patches work on Rocky?
-----Original Message----- From: Matt Riedemann mriedemos@gmail.com Sent: Tuesday, November 5, 2019 2:45 PM To: openstack-discuss@lists.openstack.org Subject: Re: CPU pinning blues
On 11/5/2019 2:11 PM, Albert Braden wrote:
I found the offending UUID in the nova_api and placement databases. Do I need to delete these entries from the DB or is there a safer way to get rid of the "phantom" VM?
MariaDB [(none)]> select * from nova_api.instance_mappings where instance_uuid = '4856d505-c220-4873-b881-836b5b75f7bb';
| created_at | updated_at | id | instance_uuid | cell_id | project_id | queued_for_delete |
| 2019-10-08 21:26:03 | NULL | 589 | 4856d505-c220-4873-b881-836b5b75f7bb | NULL | 474ae347d8ad426f8118e55eee47dcfd | 0 |
Interesting. So there is an instance mapping but it's not pointing at any cell. I'm assuming there is no entry for this instance in the nova_api.build_requests table either?
A couple of related patches for that instance mapping thing:
1. I have a patch that adds a nova-manage command to cleanup busted instance mappings [1]. In this case you'd just --purge that broken instance mapping.
2. mnaser has reported similar weird issues where an instance mapping exists but doesn't point at a cell and the build request is gone and the instance isn't in cell0. For that we have a sanity check patch [2] which might be helpful to you if you hit this again.
If either of those patches are helpful to you, please vote on the changes so we can draw some more eyes to the reviews.
As for the allocations, you can remove those from placement using the osc-placement CLI plugin [3]:
openstack resource provider allocation delete 4856d505-c220-4873-b881-836b5b75f7bb
[1] https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opendev.org_-23_... [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__review.opendev.org_-23_... [3] https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.openstack.org_osc-...
participants (2)
-
Albert Braden
-
Matt Riedemann