[nova][placement] Openstack only building one VM per machine in cluster, then runs out of resources

older
[neutron] Drivers meeting agenda...

Jeffrey Mazzone

29 Jun 2021 29 Jun '21

1:42 p.m.

Hello, I am installing Openstack Ussuri and am running into an issue when using Availability Zones. I initially thought it was a quota issue but that no longer seems to be the case. I started a thread on serverfault and was recommended to submit these questions here as well. Here is the original link: https://serverfault.com/questions/1064579/openstack-only-building-one-vm-per... The issue is still, I can successfully build vms on every host, but only one vm per host. The size of the initial vm does not matter. Since I posted the thread above, I have redeployed the entire cluster, by hand, using the docs on openstack.org<http://openstack.org>. Everything worked as it should, I created 3 test aggregates, 3 test availability zones, with no issues for about a month. All of a sudden, the system reverted to no longer allowing more than one machine to be placed per host. There has been no changes to the controller. I have enabled placement logging now so I can see more information but I don’t understand why its happening. Example. Start with a host that has no vms on it: ~# openstack resource provider usage show 3f9d0deb-936c-474a-bdee-d3df049f073d +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 0 | | MEMORY_MB | 0 | | DISK_GB | 0 | +----------------+-------+ Create 1 vm with 4 cores ~# openstack resource provider usage show 3f9d0deb-936c-474a-bdee-d3df049f073d +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 4 | | MEMORY_MB | 0 | | DISK_GB | 0 | +----------------+-------+ The inventory list for that provider is: ~# openstack resource provider inventory list 3f9d0deb-936c-474a-bdee-d3df049f073d +----------------+------------------+----------+----------+----------+-----------+--------+ | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | +----------------+------------------+----------+----------+----------+-----------+--------+ | VCPU | 16.0 | 1 | 64 | 0 | 1 | 64 | | MEMORY_MB | 1.5 | 1 | 515655 | 512 | 1 | 515655 | | DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 | +----------------+------------------+----------+----------+----------+-----------+--------+ Trying to start another vm on that host fails with the following log entries: scheduler.log "status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider conductor.log Failed to schedule instances: nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available. placement.log Over capacity for VCPU on resource provider 3f9d0deb-936c-474a-bdee-d3df049f073d. Needed: 4, Used: 8206, Capacity: 1024.0 As you can see, the used value is suddenly 8206 after a single 4 core vm is placed on it. I don’t understand what im missing or could be doing wrong. Im really unsure where this value is being calculated from. All the entries in the database and via openstack commands show the correct values except in this log entry. Has anyone experienced the same or similar behavior? I would appreciate any insight as to what the issue could be. Thanks in advance! -Jeff M

Attachments:

attachment.html (text/html — 8.7 KB)

Show replies by date

Laurent Dumont

29 Jun 29 Jun

7:08 p.m.

That is a bit strange! When you say that you only see this when using AZ, is there any issues when you don't specify the AZ and simply pick the default one? Any other logs with Unable to create allocation for 'VCPU' on resource provider? On Tue, Jun 29, 2021 at 4:47 PM Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:

...

Hello,

I am installing Openstack Ussuri and am running into an issue when using Availability Zones. I initially thought it was a quota issue but that no longer seems to be the case. I started a thread on serverfault and was recommended to submit these questions here as well. Here is the original link:

https://serverfault.com/questions/1064579/openstack-only-building-one-vm-per...

The issue is still, I can successfully build vms on every host, but only one vm per host. The size of the initial vm does not matter. Since I posted the thread above, I have redeployed the entire cluster, by hand, using the docs on openstack.org. Everything worked as it should, I created 3 test aggregates, 3 test availability zones, with no issues for about a month.

All of a sudden, the system reverted to no longer allowing more than one machine to be placed per host. There has been no changes to the controller. I have enabled placement logging now so I can see more information but I don’t understand why its happening.

Example. Start with a host that has no vms on it:

~# openstack resource provider usage show 3f9d0deb-936c-474a-bdee-d3df049f073d +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 0 | | MEMORY_MB | 0 | | DISK_GB | 0 | +----------------+-------+

Create 1 vm with 4 cores

~# openstack resource provider usage show 3f9d0deb-936c-474a-bdee-d3df049f073d +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 4 | | MEMORY_MB | 0 | | DISK_GB | 0 | +----------------+-------+

The inventory list for that provider is:

~# openstack resource provider inventory list 3f9d0deb-936c-474a-bdee-d3df049f073d +----------------+------------------+----------+----------+----------+-----------+--------+ | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | +----------------+------------------+----------+----------+----------+-----------+--------+ | VCPU | 16.0 | 1 | 64 | 0 | 1 | 64 | | MEMORY_MB | 1.5 | 1 | 515655 | 512 | 1 | 515655 | | DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 | +----------------+------------------+----------+----------+----------+-----------+--------+

Trying to start another vm on that host fails with the following log entries:

scheduler.log

"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider

conductor.log

Failed to schedule instances: nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.

placement.log

Over capacity for VCPU on resource provider 3f9d0deb-936c-474a-bdee-d3df049f073d. Needed: 4, Used: 8206, Capacity: 1024.0

As you can see, the used value is suddenly 8206 after a single 4 core vm is placed on it. I don’t understand what im missing or could be doing wrong. Im really unsure where this value is being calculated from. All the entries in the database and via openstack commands show the correct values except in this log entry. Has anyone experienced the same or similar behavior? I would appreciate any insight as to what the issue could be.

Thanks in advance!

-Jeff M

Balazs Gibizer

30 Jun 30 Jun

2:25 a.m.

On Tue, Jun 29, 2021 at 20:42, Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:

...

Hello,

[snip]

...

Trying to start another vm on that host fails with the following log entries:

scheduler.log

"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider

conductor.log

Failed to schedule instances: nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.

placement.log

Over capacity for VCPU on resource provider 3f9d0deb-936c-474a-bdee-d3df049f073d. Needed: 4, Used: 8206, Capacity: 1024.0

At this point if you list the resource provider usage on 3f9d0deb-936c-474a-bdee-d3df049f073d again then do you still see 4 VCPU used, or 8206 used? With the "openstack resource provider show 3f9d0deb-936c-474a-bdee-d3df049f073d --allocations" command you could print the UUIDs of the consumers that are actually consuming your VCPUs in placement. So you can try to identify where the 8206 allocation is coming from. Cheers, gibi

...

As you can see, the used value is suddenly 8206 after a single 4 core vm is placed on it. I don’t understand what im missing or could be doing wrong. Im really unsure where this value is being calculated from. All the entries in the database and via openstack commands show the correct values except in this log entry. Has anyone experienced the same or similar behavior? I would appreciate any insight as to what the issue could be.

Thanks in advance!

-Jeff M

Sylvain Bauza

2:49 a.m.

Le mer. 30 juin 2021 à 11:31, Balazs Gibizer <balazs.gibizer@est.tech> a écrit :

...

On Tue, Jun 29, 2021 at 20:42, Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:

...
Hello,

[snip]

...
Trying to start another vm on that host fails with the following log entries:

scheduler.log

"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider

conductor.log

Failed to schedule instances: nova.exception_Remote.NoValidHost_Remote: No valid host was found. There are not enough hosts available.

placement.log

Over capacity for VCPU on resource provider 3f9d0deb-936c-474a-bdee-d3df049f073d. Needed: 4, Used: 8206, Capacity: 1024.0

At this point if you list the resource provider usage on 3f9d0deb-936c-474a-bdee-d3df049f073d again then do you still see 4 VCPU used, or 8206 used? With the "openstack resource provider show 3f9d0deb-936c-474a-bdee-d3df049f073d --allocations" command you could print the UUIDs of the consumers that are actually consuming your VCPUs in placement. So you can try to identify where the 8206 allocation is coming from.

Given you also have an Ussuri deployment, you could call the nova-audit command to see whether you would have orphaned allocations : nova-manage placement audit [--verbose] [--delete] [--resource_provider <uuid>] See details in https://docs.openstack.org/nova/ussuri/cli/nova-manage.html#nova-api-databas... -Sylvain

...

Cheers, gibi

...
As you can see, the used value is suddenly 8206 after a single 4 core vm is placed on it. I don’t understand what im missing or could be doing wrong. Im really unsure where this value is being calculated from. All the entries in the database and via openstack commands show the correct values except in this log entry. Has anyone experienced the same or similar behavior? I would appreciate any insight as to what the issue could be.

Thanks in advance!

-Jeff M

Jeffrey Mazzone

10:32 a.m.

Any other logs with Unable to create allocation for 'VCPU' on resource provider? No, the 3 logs listed are the only logs where it is showing this message and VCPU is the only thing it fails for. No memory or disk allocation failures, always VCPU. At this point if you list the resource provider usage on 3f9d0deb-936c-474a-bdee-d3df049f073d again then do you still see 4 VCPU used, or 8206 used? The usage shows everything correctly: ~# openstack resource provider usage show 3f9d0deb-936c-474a-bdee-d3df049f073d +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 4 | | MEMORY_MB | 8192 | | DISK_GB | 10 | +----------------+-------+ Allocations shows the same: ~# openstack resource provider show 3f9d0deb-936c-474a-bdee-d3df049f073d --allocations +-------------+--------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------+--------------------------------------------------------------------------------------------------------+ | uuid | 3f9d0deb-936c-474a-bdee-d3df049f073d | | name | kh09-50 | | generation | 244 | | allocations | {'4a6fe4c2-ece4-45c2-b7a2-fdfd41308988': {'resources': {'VCPU': 4, 'MEMORY_MB': 8192, 'DISK_GB': 10}}} | +-------------+--------------------------------------------------------------------------------------------------------+ Allocation candidate list shows all 228 servers in the cluster available: ~# openstack allocation candidate list --resource VCPU=4 -c "resource provider" -f value | wc -l 228 Starting a new vm on that host shows the following in the logs: Placement-api.log 2021-06-30 12:27:21.335 4382 WARNING placement.objects.allocation [req-f4d74abc-7b18-407a-85e7-f1c268bd5e53 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider 0e0d8ec8-bb31-4da5-a813-bd73560ff7d6. Needed: 4, Used: 8206, Capacity: 1024.0 nova-scheduler.log 2021-06-30 12:27:21.429 6895 WARNING nova.scheduler.client.report [req-3106f4da-1df9-4370-b56b-8ba6b62980dc aacc7911abf349b783eed20ad176c034 23920ecfbf294e71ad558aa49cb17de8 - default default] Failed to save allocation for a9296e22-4b50-45b7-a442-1fce0a844bcd. Got HTTP 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider '3f9d0deb-936c-474a-bdee-d3df049f073d'. The requested amount would exceed the capacity. ", "code": "placement.undefined_code", "request_id": "req-e9f12a3a-3136-4501-8bd6-4add31f0eb82"}]} I really can’t figure out where this, what’s seems to be last minute, calculation of used resources comes from. Given you also have an Ussuri deployment, you could call the nova-audit command to see whether you would have orphaned allocations : nova-manage placement audit [--verbose] [--delete] [--resource_provider <uuid>] When running this command, it says the UUID does not exist. Thank you! I truly appreciate everyones help. -Jeff M

Laurent Dumont

12:40 p.m.

In some cases, the DEBUG messages are a bit verbose but can really walk you through the allocation/scheduling process. You could increase it for nova and restart the api + scheduler on the controllers. I wonder if a desync of the DB could be in cause? Are you running an HA deployment for the mysql backend? On Wed, Jun 30, 2021 at 1:44 PM Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:

...

Any other logs with Unable to create allocation for 'VCPU' on resource provider?

No, the 3 logs listed are the only logs where it is showing this message and VCPU is the only thing it fails for. No memory or disk allocation failures, always VCPU.

At this point if you list the resource provider usage on 3f9d0deb-936c-474a-bdee-d3df049f073d again then do you still see 4 VCPU used, or 8206 used?

The usage shows everything correctly:

~# openstack resource provider usage show 3f9d0deb-936c-474a-bdee-d3df049f073d +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 4 | | MEMORY_MB | 8192 | | DISK_GB | 10 | +----------------+-------+

Allocations shows the same:

~# openstack resource provider show 3f9d0deb-936c-474a-bdee-d3df049f073d --allocations +-------------+--------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------+--------------------------------------------------------------------------------------------------------+ | uuid | 3f9d0deb-936c-474a-bdee-d3df049f073d | | name | kh09-50 | | generation | 244 | | allocations | {'4a6fe4c2-ece4-45c2-b7a2-fdfd41308988': {'resources': {'VCPU': 4, 'MEMORY_MB': 8192, 'DISK_GB': 10}}} | +-------------+--------------------------------------------------------------------------------------------------------+

Allocation candidate list shows all 228 servers in the cluster available:

~# openstack allocation candidate list --resource VCPU=4 -c "resource provider" -f value | wc -l 228

Starting a new vm on that host shows the following in the logs:

Placement-api.log

2021-06-30 12:27:21.335 4382 WARNING placement.objects.allocation [req-f4d74abc-7b18-407a-85e7-f1c268bd5e53 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider 0e0d8ec8-bb31-4da5-a813-bd73560ff7d6. Needed: 4, Used: 8206, Capacity: 1024.0

nova-scheduler.log

2021-06-30 12:27:21.429 6895 WARNING nova.scheduler.client.report [req-3106f4da-1df9-4370-b56b-8ba6b62980dc aacc7911abf349b783eed20ad176c034 23920ecfbf294e71ad558aa49cb17de8 - default default] Failed to save allocation for a9296e22-4b50-45b7-a442-1fce0a844bcd. Got HTTP 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider '3f9d0deb-936c-474a-bdee-d3df049f073d'. The requested amount would exceed the capacity. ", "code": "placement.undefined_code", "request_id": "req-e9f12a3a-3136-4501-8bd6-4add31f0eb82"}]}

I really can’t figure out where this, what’s seems to be last minute, calculation of used resources comes from.

Given you also have an Ussuri deployment, you could call the nova-audit command to see whether you would have orphaned allocations : nova-manage placement audit [--verbose] [--delete] [--resource_provider <uuid>]

When running this command, it says the UUID does not exist.

Thank you! I truly appreciate everyones help.

-Jeff M

Laurent Dumont

1:47 p.m.

I think you can also use "nova-manage placement audit" without a uuid of one of the compute. That should iterate over everything. On Wed, Jun 30, 2021 at 3:40 PM Laurent Dumont <laurentfdumont@gmail.com> wrote:

...

In some cases, the DEBUG messages are a bit verbose but can really walk you through the allocation/scheduling process. You could increase it for nova and restart the api + scheduler on the controllers. I wonder if a desync of the DB could be in cause? Are you running an HA deployment for the mysql backend?

On Wed, Jun 30, 2021 at 1:44 PM Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:

...
Any other logs with Unable to create allocation for 'VCPU' on resource provider?

No, the 3 logs listed are the only logs where it is showing this message and VCPU is the only thing it fails for. No memory or disk allocation failures, always VCPU.

At this point if you list the resource provider usage on 3f9d0deb-936c-474a-bdee-d3df049f073d again then do you still see 4 VCPU used, or 8206 used?

The usage shows everything correctly:

~# openstack resource provider usage show 3f9d0deb-936c-474a-bdee-d3df049f073d +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 4 | | MEMORY_MB | 8192 | | DISK_GB | 10 | +----------------+-------+

Allocations shows the same:

~# openstack resource provider show 3f9d0deb-936c-474a-bdee-d3df049f073d --allocations +-------------+--------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------+--------------------------------------------------------------------------------------------------------+ | uuid | 3f9d0deb-936c-474a-bdee-d3df049f073d | | name | kh09-50 | | generation | 244 | | allocations | {'4a6fe4c2-ece4-45c2-b7a2-fdfd41308988': {'resources': {'VCPU': 4, 'MEMORY_MB': 8192, 'DISK_GB': 10}}} | +-------------+--------------------------------------------------------------------------------------------------------+

Allocation candidate list shows all 228 servers in the cluster available:

~# openstack allocation candidate list --resource VCPU=4 -c "resource provider" -f value | wc -l 228

Starting a new vm on that host shows the following in the logs:

Placement-api.log

2021-06-30 12:27:21.335 4382 WARNING placement.objects.allocation [req-f4d74abc-7b18-407a-85e7-f1c268bd5e53 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider 0e0d8ec8-bb31-4da5-a813-bd73560ff7d6. Needed: 4, Used: 8206, Capacity: 1024.0

nova-scheduler.log

2021-06-30 12:27:21.429 6895 WARNING nova.scheduler.client.report [req-3106f4da-1df9-4370-b56b-8ba6b62980dc aacc7911abf349b783eed20ad176c034 23920ecfbf294e71ad558aa49cb17de8 - default default] Failed to save allocation for a9296e22-4b50-45b7-a442-1fce0a844bcd. Got HTTP 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider '3f9d0deb-936c-474a-bdee-d3df049f073d'. The requested amount would exceed the capacity. ", "code": "placement.undefined_code", "request_id": "req-e9f12a3a-3136-4501-8bd6-4add31f0eb82"}]}

I really can’t figure out where this, what’s seems to be last minute, calculation of used resources comes from.

Given you also have an Ussuri deployment, you could call the nova-audit command to see whether you would have orphaned allocations : nova-manage placement audit [--verbose] [--delete] [--resource_provider <uuid>]

When running this command, it says the UUID does not exist.

Thank you! I truly appreciate everyones help.

-Jeff M

Jeffrey Mazzone

2:06 p.m.

Yes, this is almost exactly what I did. No, I am not running mysql in a HA deployment and I have ran nova-manage api_db sync several times throughout the process below. I think I found a work around but im not sure how feasible this is. I first, changed the reallocation ratio to 1:1. In the nova.conf on the controller. Nova would not accept this for some reason and seemed like it needed to be changed on the compute node. So I deleted the hypervisor, resource provider, and compute service. Changed the ratios on the compute node itself, and then re-added it back in. Now the capacity changed to 64 which is the number of cores on the systems. When starting a vm, it still gets the same number for “used” in the placement-api.log: See below: New ratios ~# openstack resource provider inventory list 554f2a3b-924e-440c-9847-596064ea0f3f +----------------+------------------+----------+----------+----------+-----------+--------+ | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | +----------------+------------------+----------+----------+----------+-----------+--------+ | VCPU | 1.0 | 1 | 64 | 0 | 1 | 64 | | MEMORY_MB | 1.0 | 1 | 515655 | 512 | 1 | 515655 | | DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 | +----------------+------------------+----------+----------+----------+-----------+--------+ Error from placement.log 2021-06-30 13:49:24.877 4381 WARNING placement.objects.allocation [req-7dc8930f-1eac-401a-ade7-af36e64c2ba8 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider c4199e84-8259-4d0e-9361-9b0d9e6e66b7. Needed: 4, Used: 8206, Capacity: 64.0 With that in mind, I did the same procedure again but set the ratio to 1024 New ratios ~# openstack resource provider inventory list 519c1e10-3546-4e3b-a017-3e831376cde8 +----------------+------------------+----------+----------+----------+-----------+--------+ | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | +----------------+------------------+----------+----------+----------+-----------+--------+ | VCPU | 1024.0 | 1 | 64 | 0 | 1 | 64 | | MEMORY_MB | 1.0 | 1 | 515655 | 512 | 1 | 515655 | | DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 | +----------------+------------------+----------+----------+----------+-----------+--------+ Now I can spin up vms without issues. I have 1 test AZ with 2 hosts inside. I have set these hosts to the ratio above. I was able to spin up approx 45 4x core VMs without issues and no signs of it hitting an upper limit on the host. 120 | VCPU=64 | 519c1e10-3546-4e3b-a017-3e831376cde8 | VCPU=88/65536 23 | VCPU=64 | 8f97a3ba-98a0-475e-a3cf-41425569b2cb | VCPU=96/65536 I have 2 problems with this fix. 1) the overcommit is now super high and I have no way, besides quotas, to guarantee the system won’t be over provisioned. 2) I still don’t know how that “used” resources value is being calculated. When this issue first started, the “used” resources were a different number. Over the past two days, the used resources for a 4 core virtual machine have remained at 8206 but I have no way to guarantee this. My initial tests when this started was to compare the resource values when building different size vms. Here is that list: 1 core - 4107 2 core - 4108 4 core- 4110 8 core - 4114 16 core - 4122 32 core - 8234 The number on the right is the number the “used” value used to be. Yesterday and today, it has changed to 8206 for a 4 core vm, I have not tested the rest. Before I commit to combing through the placement api source code to figure out how the “used” value in the placement log is being calculated, im hoping someone knows where and how that value is being calculated. It does not seem to be a fixed value in the database and it doesn’t seem to be effected by the allocation ratios. Thank you in advance!! -Jeff Mazzone Senior Linux Systems Administrator Center for Translational Data Science University of Chicago. On Jun 30, 2021, at 2:40 PM, Laurent Dumont <laurentfdumont@gmail.com<mailto:laurentfdumont@gmail.com>> wrote: In some cases, the DEBUG messages are a bit verbose but can really walk you through the allocation/scheduling process. You could increase it for nova and restart the api + scheduler on the controllers. I wonder if a desync of the DB could be in cause? Are you running an HA deployment for the mysql backend?

Laurent Dumont

2:52 p.m.

I unfortunately can't add much as I don't have an Ussuri cloud to test with. That said, I would be curious about the debug level outputs from both the controller scheduler/placement as well as one compute where a claim could have happened. The allocation ratio for CPU is 16 by default. That said, you could also leverage pinned CPUs to prevent any overcommit. But that's not as "simple" as telling Openstack not to oversub on CPU cores. On Wed, Jun 30, 2021 at 5:06 PM Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:

...

Yes, this is almost exactly what I did. No, I am not running mysql in a HA deployment and I have ran nova-manage api_db sync several times throughout the process below.

I think I found a work around but im not sure how feasible this is.

I first, changed the reallocation ratio to 1:1. In the nova.conf on the controller. Nova would not accept this for some reason and seemed like it needed to be changed on the compute node. So I deleted the hypervisor, resource provider, and compute service. Changed the ratios on the compute node itself, and then re-added it back in. Now the capacity changed to 64 which is the number of cores on the systems. When starting a vm, it still gets the same number for “used” in the placement-api.log: See below:

New ratios

~# openstack resource provider inventory list 554f2a3b-924e-440c-9847-596064ea0f3f +----------------+------------------+----------+----------+----------+-----------+--------+ | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | +----------------+------------------+----------+----------+----------+-----------+--------+ | VCPU | 1.0 | 1 | 64 | 0 | 1 | 64 | | MEMORY_MB | 1.0 | 1 | 515655 | 512 | 1 | 515655 | | DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 | +----------------+------------------+----------+----------+----------+-----------+--------+

Error from placement.log

2021-06-30 13:49:24.877 4381 WARNING placement.objects.allocation [req-7dc8930f-1eac-401a-ade7-af36e64c2ba8 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider c4199e84-8259-4d0e-9361-9b0d9e6e66b7. Needed: 4, Used: 8206, Capacity: 64.0

With that in mind, I did the same procedure again but set the ratio to 1024

New ratios

~# openstack resource provider inventory list 519c1e10-3546-4e3b-a017-3e831376cde8 +----------------+------------------+----------+----------+----------+-----------+--------+ | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | +----------------+------------------+----------+----------+----------+-----------+--------+ | VCPU | 1024.0 | 1 | 64 | 0 | 1 | 64 | | MEMORY_MB | 1.0 | 1 | 515655 | 512 | 1 | 515655 | | DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 | +----------------+------------------+----------+----------+----------+-----------+--------+

Now I can spin up vms without issues.

I have 1 test AZ with 2 hosts inside. I have set these hosts to the ratio above. I was able to spin up approx 45 4x core VMs without issues and no signs of it hitting an upper limit on the host.

120 | VCPU=64 | 519c1e10-3546-4e3b-a017-3e831376cde8 | VCPU=88/65536 23 | VCPU=64 | 8f97a3ba-98a0-475e-a3cf-41425569b2cb | VCPU=96/65536

I have 2 problems with this fix.

1) the overcommit is now super high and I have no way, besides quotas, to guarantee the system won’t be over provisioned. 2) I still don’t know how that “used” resources value is being calculated. When this issue first started, the “used” resources were a different number. Over the past two days, the used resources for a 4 core virtual machine have remained at 8206 but I have no way to guarantee this.

My initial tests when this started was to compare the resource values when building different size vms. Here is that list:

1 core - 4107 2 core - 4108 4 core- 4110 8 core - 4114 16 core - 4122 32 core - 8234

The number on the right is the number the “used” value used to be. Yesterday and today, it has changed to 8206 for a 4 core vm, I have not tested the rest.

Before I commit to combing through the placement api source code to figure out how the “used” value in the placement log is being calculated, im hoping someone knows where and how that value is being calculated. It does not seem to be a fixed value in the database and it doesn’t seem to be effected by the allocation ratios.

Thank you in advance!! -Jeff Mazzone Senior Linux Systems Administrator Center for Translational Data Science University of Chicago.

On Jun 30, 2021, at 2:40 PM, Laurent Dumont <laurentfdumont@gmail.com> wrote:

In some cases, the DEBUG messages are a bit verbose but can really walk you through the allocation/scheduling process. You could increase it for nova and restart the api + scheduler on the controllers. I wonder if a desync of the DB could be in cause? Are you running an HA deployment for the mysql backend?

melanie witt

3:06 p.m.

On 6/30/21 14:06, Jeffrey Mazzone wrote: [snip]

...

Before I commit to combing through the placement api source code to figure out how the “used” value in the placement log is being calculated, im hoping someone knows where and how that value is being calculated. It does not seem to be a fixed value in the database and it doesn’t seem to be effected by the allocation ratios.

I suggest you run the 'openstack resource provider show <RP UUID> --allocations' command as Balazs mentioned earlier to show all of the allocations (used resources) on the compute node. I also suggest you run the 'nova-manage placement audit' tool [1] as Sylvain mentioned earlier to show whether there are any orphaned allocations, i.e. allocations that are for instances that no longer exist. The consumer UUID is the instance UUID. The tl;dr on how the value is calculated is there's a table called 'allocations' in the placement database that holds all the values for resource providers and resource classes and it has a 'used' column. If you add up all of the 'used' values for a resource class (VCPU) and resource provider (compute node) then that will be the total used of that resource on that resource provider. You can see this data by 'openstack resource provider show <RP UUID> --allocations' as well. The allocation ratio will not affect the value of 'used' but it will affect the working value of 'total' to be considered higher than it actually is in order to oversubscribe. If a compute node has 64 cores and cpu_allocation ratio is 16 then 64 * 16 = 1024 cores will be allowed for placement on that compute node. You likely have "orphaned" allocations for the compute node/resource provider that are not mapped to instances any more and you can use 'nova-manage placement audit' to find those and optionally delete them. Doing that will cleanup your resource provider. First, I would run it without specifying --delete just to see what it shows without modifying anything. HTH, -melwitt [1] https://docs.openstack.org/nova/ussuri/cli/nova-manage.html#placement

Jeffrey Mazzone

6:13 p.m.

On Jun 30, 2021, at 5:06 PM, melanie witt <melwittt@gmail.com<mailto:melwittt@gmail.com>> wrote: I suggest you run the 'openstack resource provider show <RP UUID> --allocations' command as Balazs mentioned earlier to show all of the allocations (used resources) on the compute node. I also suggest you run the 'nova-manage placement audit' tool [1] as Sylvain mentioned earlier to show whether there are any orphaned allocations, i.e. allocations that are for instances that no longer exist. The consumer UUID is the instance UUID. I did both of those suggestions. "openstack resource provider show <RP UUID> —allocations" shows what is expected. No additional orphaned vms and the resources used is correct. Here is an example of a different set of hosts and zones. This host had 2x 16 core vms on it before the cluster went into this state. You can see them both below. The nova-manage audit commands do not show any orphans either. ~# openstack resource provider show 41ecee2a-ec24-48e5-8b9d-24065d67238a --allocations +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | uuid | 41ecee2a-ec24-48e5-8b9d-24065d67238a | | name | kh09-56 | | generation | 55 | | root_provider_uuid | 41ecee2a-ec24-48e5-8b9d-24065d67238a | | parent_provider_uuid | None | | allocations | {'d6b9d19c-1ba9-44c2-97ab-90098509b872': {'resources': {'DISK_GB': 50, 'MEMORY_MB': 16384, 'VCPU': 16}, 'consumer_generation': 1}, 'e0a8401a-0bb6-4612-a496-6a794ebe6cd0': {'resources': {'DISK_GB': 50, 'MEMORY_MB': 16384, 'VCPU': 16}, 'consumer_generation': 1}} | +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ Usage on the resource provider: ~# openstack resource provider usage show 41ecee2a-ec24-48e5-8b9d-24065d67238a +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 32 | | MEMORY_MB | 32768 | | DISK_GB | 100 | +----------------+-------+ All of that looks correct. Requesting it to check allocations for a 4 VCPU vm also shows it as a candidate: ~# openstack allocation candidate list --resource VCPU=4 | grep 41ecee2a-ec24-48e5-8b9d-24065d67238a | 41 | VCPU=4 | 41ecee2a-ec24-48e5-8b9d-24065d67238a | VCPU=32/1024,MEMORY_MB=32768/772714,DISK_GB=100/7096 In the placement database, under the used column, also shows the correct values for the information provided above with those 2 vms on it: +---------------------+------------+-------+----------------------+--------------------------------------+-------------------+-------+ | created_at | updated_at | id | resource_provider_id | consumer_id | resource_class_id | used | +---------------------+------------+-------+----------------------+--------------------------------------+-------------------+-------+ | 2021-06-02 18:45:05 | NULL | 4060 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 2 | 50 | | 2021-06-02 18:45:05 | NULL | 4061 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 1 | 16384 | | 2021-06-02 18:45:05 | NULL | 4062 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 0 | 16 | | 2021-06-04 18:39:13 | NULL | 7654 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 2 | 50 | | 2021-06-04 18:39:13 | NULL | 7655 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 1 | 16384 | | 2021-06-04 18:39:13 | NULL | 7656 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 0 | 16 | Trying to build a vm though.. I get the placement error with the improperly calculated “Used” values. 2021-06-30 19:51:39.732 43832 WARNING placement.objects.allocation [req-de225c66-8297-4b34-9380-26cf9385d658 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider b749130c-a368-4332-8a1f-8411851b4b2a. Needed: 4, Used: 18509, Capacity: 1024.0 Outside of changing the allocation ratio, im completely lost. Im confident it has to do with that improper calculation of the used value but how is it being calculated if it isn’t being added up from fixed values in the database as has been suggested? Thanks in advance! -Jeff M The tl;dr on how the value is calculated is there's a table called 'allocations' in the placement database that holds all the values for resource providers and resource classes and it has a 'used' column. If you add up all of the 'used' values for a resource class (VCPU) and resource provider (compute node) then that will be the total used of that resource on that resource provider. You can see this data by 'openstack resource provider show <RP UUID> --allocations' as well. The allocation ratio will not affect the value of 'used' but it will affect the working value of 'total' to be considered higher than it actually is in order to oversubscribe. If a compute node has 64 cores and cpu_allocation ratio is 16 then 64 * 16 = 1024 cores will be allowed for placement on that compute node. You likely have "orphaned" allocations for the compute node/resource provider that are not mapped to instances any more and you can use 'nova-manage placement audit' to find those and optionally delete them. Doing that will cleanup your resource provider. First, I would run it without specifying --delete just to see what it shows without modifying anything.

Laurent Dumont

8:51 p.m.

I'm curious to see if I can reproduce the issue in my test-env. I never tried puppet-openstack so might as well see how it goes! The ServerFault issue mentions the puppet-openstack integration being used to deploy Ussuri? Specifically, the puppet modules being at the 17.4 version? But looking at https://docs.openstack.org/puppet-openstack-guide/latest/install/releases.ht... - the modules for Ussuri should be at 16.x? Could it be some kind of weird setup of the deployment modules for Ussuri/placement that didn't go as planned? On Wed, Jun 30, 2021 at 9:13 PM Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:

...

On Jun 30, 2021, at 5:06 PM, melanie witt <melwittt@gmail.com> wrote:

I suggest you run the 'openstack resource provider show <RP UUID> --allocations' command as Balazs mentioned earlier to show all of the allocations (used resources) on the compute node. I also suggest you run the 'nova-manage placement audit' tool [1] as Sylvain mentioned earlier to show whether there are any orphaned allocations, i.e. allocations that are for instances that no longer exist. The consumer UUID is the instance UUID.

I did both of those suggestions. "openstack resource provider show <RP UUID> —allocations" shows what is expected. No additional orphaned vms and the resources used is correct. Here is an example of a different set of hosts and zones. This host had 2x 16 core vms on it before the cluster went into this state. You can see them both below. The nova-manage audit commands do not show any orphans either.

~# openstack resource provider show 41ecee2a-ec24-48e5-8b9d-24065d67238a --allocations +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value | +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | uuid | 41ecee2a-ec24-48e5-8b9d-24065d67238a | | name | kh09-56 | | generation | 55 | | root_provider_uuid | 41ecee2a-ec24-48e5-8b9d-24065d67238a | | parent_provider_uuid | None | | allocations | {'d6b9d19c-1ba9-44c2-97ab-90098509b872': {'resources': {'DISK_GB': 50, 'MEMORY_MB': 16384, 'VCPU': 16}, 'consumer_generation': 1}, 'e0a8401a-0bb6-4612-a496-6a794ebe6cd0': {'resources': {'DISK_GB': 50, 'MEMORY_MB': 16384, 'VCPU': 16}, 'consumer_generation': 1}} | +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Usage on the resource provider:

~# openstack resource provider usage show 41ecee2a-ec24-48e5-8b9d-24065d67238a +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 32 | | MEMORY_MB | 32768 | | DISK_GB | 100 | +----------------+-------+

All of that looks correct. Requesting it to check allocations for a 4 VCPU vm also shows it as a candidate:

~# openstack allocation candidate list --resource VCPU=4 | grep 41ecee2a-ec24-48e5-8b9d-24065d67238a | 41 | VCPU=4 | 41ecee2a-ec24-48e5-8b9d-24065d67238a | VCPU=32/1024,MEMORY_MB=32768/772714,DISK_GB=100/7096

In the placement database, under the used column, also shows the correct values for the information provided above with those 2 vms on it:

+---------------------+------------+-------+----------------------+--------------------------------------+-------------------+-------+ | created_at | updated_at | id | resource_provider_id | consumer_id | resource_class_id | used | +---------------------+------------+-------+----------------------+--------------------------------------+-------------------+-------+ | 2021-06-02 18:45:05 | NULL | 4060 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 2 | 50 | | 2021-06-02 18:45:05 | NULL | 4061 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 1 | 16384 | | 2021-06-02 18:45:05 | NULL | 4062 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 0 | 16 | | 2021-06-04 18:39:13 | NULL | 7654 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 2 | 50 | | 2021-06-04 18:39:13 | NULL | 7655 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 1 | 16384 | | 2021-06-04 18:39:13 | NULL | 7656 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 0 | 16 |

Trying to build a vm though.. I get the placement error with the improperly calculated “Used” values.

2021-06-30 19:51:39.732 43832 WARNING placement.objects.allocation [req-de225c66-8297-4b34-9380-26cf9385d658 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider b749130c-a368-4332-8a1f-8411851b4b2a. Needed: 4, Used: 18509, Capacity: 1024.0

Outside of changing the allocation ratio, im completely lost. Im confident it has to do with that improper calculation of the used value but how is it being calculated if it isn’t being added up from fixed values in the database as has been suggested?

Thanks in advance! -Jeff M

The tl;dr on how the value is calculated is there's a table called 'allocations' in the placement database that holds all the values for resource providers and resource classes and it has a 'used' column. If you add up all of the 'used' values for a resource class (VCPU) and resource provider (compute node) then that will be the total used of that resource on that resource provider. You can see this data by 'openstack resource provider show <RP UUID> --allocations' as well.

The allocation ratio will not affect the value of 'used' but it will affect the working value of 'total' to be considered higher than it actually is in order to oversubscribe. If a compute node has 64 cores and cpu_allocation ratio is 16 then 64 * 16 = 1024 cores will be allowed for placement on that compute node.

You likely have "orphaned" allocations for the compute node/resource provider that are not mapped to instances any more and you can use 'nova-manage placement audit' to find those and optionally delete them. Doing that will cleanup your resource provider. First, I would run it without specifying --delete just to see what it shows without modifying anything.

Balazs Gibizer

1 Jul 1 Jul

12:41 a.m.

On Thu, Jul 1, 2021 at 01:13, Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:

...

...
On Jun 30, 2021, at 5:06 PM, melanie witt <melwittt@gmail.com> wrote:

I suggest you run the 'openstack resource provider show <RP UUID> --allocations' command as Balazs mentioned earlier to show all of the allocations (used resources) on the compute node. I also suggest you run the 'nova-manage placement audit' tool [1] as Sylvain mentioned earlier to show whether there are any orphaned allocations, i.e. allocations that are for instances that no longer exist. The consumer UUID is the instance UUID.

I did both of those suggestions. "openstack resource provider show <RP UUID> —allocations" shows what is expected. No additional orphaned vms and the resources used is correct. Here is an example of a different set of hosts and zones. This host had 2x 16 core vms on it before the cluster went into this state. You can see them both below. The nova-manage audit commands do not show any orphans either.

~# openstack resource provider show 41ecee2a-ec24-48e5-8b9d-24065d67238a --allocations +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value

| +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | uuid | 41ecee2a-ec24-48e5-8b9d-24065d67238a

| | name | kh09-56

| | generation | 55

| | root_provider_uuid | 41ecee2a-ec24-48e5-8b9d-24065d67238a

| | parent_provider_uuid | None

| | allocations | {'d6b9d19c-1ba9-44c2-97ab-90098509b872': {'resources': {'DISK_GB': 50, 'MEMORY_MB': 16384, 'VCPU': 16}, 'consumer_generation': 1}, 'e0a8401a-0bb6-4612-a496-6a794ebe6cd0': {'resources': {'DISK_GB': 50, 'MEMORY_MB': 16384, 'VCPU': 16}, 'consumer_generation': 1}} | +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Usage on the resource provider: ~# openstack resource provider usage show 41ecee2a-ec24-48e5-8b9d-24065d67238a +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 32 | | MEMORY_MB | 32768 | | DISK_GB | 100 | +----------------+-------+

All of that looks correct. Requesting it to check allocations for a 4 VCPU vm also shows it as a candidate: ~# openstack allocation candidate list --resource VCPU=4 | grep 41ecee2a-ec24-48e5-8b9d-24065d67238a | 41 | VCPU=4 | 41ecee2a-ec24-48e5-8b9d-24065d67238a | VCPU=32/1024,MEMORY_MB=32768/772714,DISK_GB=100/7096

In the placement database, under the used column, also shows the correct values for the information provided above with those 2 vms on it: +---------------------+------------+-------+----------------------+--------------------------------------+-------------------+-------+ | created_at | updated_at | id | resource_provider_id | consumer_id | resource_class_id | used | +---------------------+------------+-------+----------------------+--------------------------------------+-------------------+-------+ | 2021-06-02 18:45:05 | NULL | 4060 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 2 | 50 | | 2021-06-02 18:45:05 | NULL | 4061 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 1 | 16384 | | 2021-06-02 18:45:05 | NULL | 4062 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 0 | 16 | | 2021-06-04 18:39:13 | NULL | 7654 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 2 | 50 | | 2021-06-04 18:39:13 | NULL | 7655 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 1 | 16384 | | 2021-06-04 18:39:13 | NULL | 7656 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 0 | 16 |

Trying to build a vm though.. I get the placement error with the improperly calculated “Used” values.

2021-06-30 19:51:39.732 43832 WARNING placement.objects.allocation [req-de225c66-8297-4b34-9380-26cf9385d658 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider b749130c-a368-4332-8a1f-8411851b4b2a. Needed: 4, Used: 18509, Capacity: 1024.0

Again you confirmed that the compute RP 41ecee2a-ec24-48e5-8b9d-24065d67238a has a consistent resource view but placement warns about another compute b749130c-a368-4332-8a1f-8411851b4b2a. Could you try to trace through one single situation? Try to boot a VM that results in the error with the placement over capacity warning. Then collect the resource view of the compute RP the placement warning points at. If the result of such tracing is not showing the reason then you can dig the placement code. The placement warning comes from https://github.com/openstack/placement/blob/f77a7f9928d1156450c48045c48597b2... top of that function there is an SQL command you can try to apply to your DB and the resource provider placement warns about to see where the used value are coming from. Cheers, gibi

...

Outside of changing the allocation ratio, im completely lost. Im confident it has to do with that improper calculation of the used value but how is it being calculated if it isn’t being added up from fixed values in the database as has been suggested?

Thanks in advance! -Jeff M

...
The tl;dr on how the value is calculated is there's a table called 'allocations' in the placement database that holds all the values for resource providers and resource classes and it has a 'used' column. If you add up all of the 'used' values for a resource class (VCPU) and resource provider (compute node) then that will be the total used of that resource on that resource provider. You can see this data by 'openstack resource provider show <RP UUID> --allocations' as well.

The allocation ratio will not affect the value of 'used' but it will affect the working value of 'total' to be considered higher than it actually is in order to oversubscribe. If a compute node has 64 cores and cpu_allocation ratio is 16 then 64 * 16 = 1024 cores will be allowed for placement on that compute node.

You likely have "orphaned" allocations for the compute node/resource provider that are not mapped to instances any more and you can use 'nova-manage placement audit' to find those and optionally delete them. Doing that will cleanup your resource provider. First, I would run it without specifying --delete just to see what it shows without modifying anything.

Balazs Gibizer

12:30 a.m.

On Wed, Jun 30, 2021 at 21:06, Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:

...

Yes, this is almost exactly what I did. No, I am not running mysql in a HA deployment and I have ran nova-manage api_db sync several times throughout the process below.

I think I found a work around but im not sure how feasible this is.

I first, changed the reallocation ratio to 1:1. In the nova.conf on the controller. Nova would not accept this for some reason and seemed like it needed to be changed on the compute node. So I deleted the hypervisor, resource provider, and compute service. Changed the ratios on the compute node itself, and then re-added it back in. Now the capacity changed to 64 which is the number of cores on the systems. When starting a vm, it still gets the same number for “used” in the placement-api.log: See below:

New ratios ~# openstack resource provider inventory list 554f2a3b-924e-440c-9847-596064ea0f3f +----------------+------------------+----------+----------+----------+-----------+--------+ | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | +----------------+------------------+----------+----------+----------+-----------+--------+ | VCPU | 1.0 | 1 | 64 | 0 | 1 | 64 | | MEMORY_MB | 1.0 | 1 | 515655 | 512 | 1 | 515655 | | DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 | +----------------+------------------+----------+----------+----------+-----------+--------+

Error from placement.log 2021-06-30 13:49:24.877 4381 WARNING placement.objects.allocation [req-7dc8930f-1eac-401a-ade7-af36e64c2ba8 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider c4199e84-8259-4d0e-9361-9b0d9e6e66b7. Needed: 4, Used: 8206, Capacity: 64.0

With that in mind, I did the same procedure again but set the ratio to 1024

New ratios ~# openstack resource provider inventory list 519c1e10-3546-4e3b-a017-3e831376cde8 +----------------+------------------+----------+----------+----------+-----------+--------+ | resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total | +----------------+------------------+----------+----------+----------+-----------+--------+ | VCPU | 1024.0 | 1 | 64 | 0 | 1 | 64 | | MEMORY_MB | 1.0 | 1 | 515655 | 512 | 1 | 515655 | | DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 | +----------------+------------------+----------+----------+----------+-----------+--------+

Your are collecting data from the compute RP 519c1e10-3546-4e3b-a017-3e831376cde8 but placement warns about another compute RP c4199e84-8259-4d0e-9361-9b0d9e6e66b7.

...

Now I can spin up vms without issues.

I have 1 test AZ with 2 hosts inside. I have set these hosts to the ratio above. I was able to spin up approx 45 4x core VMs without issues and no signs of it hitting an upper limit on the host.

120 | VCPU=64 | 519c1e10-3546-4e3b-a017-3e831376cde8 | VCPU=88/65536 23 | VCPU=64 | 8f97a3ba-98a0-475e-a3cf-41425569b2cb | VCPU=96/65536

I have 2 problems with this fix.

1) the overcommit is now super high and I have no way, besides quotas, to guarantee the system won’t be over provisioned. 2) I still don’t know how that “used” resources value is being calculated. When this issue first started, the “used” resources were a different number. Over the past two days, the used resources for a 4 core virtual machine have remained at 8206 but I have no way to guarantee this.

My initial tests when this started was to compare the resource values when building different size vms. Here is that list:

1 core - 4107 2 core - 4108 4 core- 4110 8 core - 4114 16 core - 4122 32 core - 8234

The number on the right is the number the “used” value used to be. Yesterday and today, it has changed to 8206 for a 4 core vm, I have not tested the rest.

Before I commit to combing through the placement api source code to figure out how the “used” value in the placement log is being calculated, im hoping someone knows where and how that value is being calculated. It does not seem to be a fixed value in the database and it doesn’t seem to be effected by the allocation ratios.

Thank you in advance!! -Jeff Mazzone Senior Linux Systems Administrator Center for Translational Data Science University of Chicago.

...
On Jun 30, 2021, at 2:40 PM, Laurent Dumont <laurentfdumont@gmail.com> wrote:

In some cases, the DEBUG messages are a bit verbose but can really walk you through the allocation/scheduling process. You could increase it for nova and restart the api + scheduler on the controllers. I wonder if a desync of the DB could be in cause? Are you running an HA deployment for the mysql backend?

Balazs Gibizer

12:28 a.m.

On Wed, Jun 30, 2021 at 17:32, Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:

...

Any other logs with Unable to create allocation for 'VCPU' on resource provider?

No, the 3 logs listed are the only logs where it is showing this message and VCPU is the only thing it fails for. No memory or disk allocation failures, always VCPU.

At this point if you list the resource provider usage on 3f9d0deb-936c-474a-bdee-d3df049f073d again then do you still see 4 VCPU used, or 8206 used?

The usage shows everything correctly: ~# openstack resource provider usage show 3f9d0deb-936c-474a-bdee-d3df049f073d +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 4 | | MEMORY_MB | 8192 | | DISK_GB | 10 | +----------------+-------+

Allocations shows the same:

~# openstack resource provider show 3f9d0deb-936c-474a-bdee-d3df049f073d --allocations +-------------+--------------------------------------------------------------------------------------------------------+ | Field | Value | +-------------+--------------------------------------------------------------------------------------------------------+ | uuid | 3f9d0deb-936c-474a-bdee-d3df049f073d | | name | kh09-50 | | generation | 244 | | allocations | {'4a6fe4c2-ece4-45c2-b7a2-fdfd41308988': {'resources': {'VCPU': 4, 'MEMORY_MB': 8192, 'DISK_GB': 10}}} | +-------------+--------------------------------------------------------------------------------------------------------+

Allocation candidate list shows all 228 servers in the cluster available:

~# openstack allocation candidate list --resource VCPU=4 -c "resource provider" -f value | wc -l 228

Starting a new vm on that host shows the following in the logs:

Placement-api.log 2021-06-30 12:27:21.335 4382 WARNING placement.objects.allocation [req-f4d74abc-7b18-407a-85e7-f1c268bd5e53 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider 0e0d8ec8-bb31-4da5-a813-bd73560ff7d6. Needed: 4, Used: 8206, Capacity: 1024.0

You said "Starting a new vm on that host". How do you do that? Something is strange. Now placement points to other than 3f9d0deb-936c-474a-bdee-d3df049f073d, it points to 0e0d8ec8-bb31-4da5-a813-bd73560ff7d6.

...

nova-scheduler.log 2021-06-30 12:27:21.429 6895 WARNING nova.scheduler.client.report [req-3106f4da-1df9-4370-b56b-8ba6b62980dc aacc7911abf349b783eed20ad176c034 23920ecfbf294e71ad558aa49cb17de8 - default default] Failed to save allocation for a9296e22-4b50-45b7-a442-1fce0a844bcd. Got HTTP 409: {"errors": [{"status": 409, "title": "Conflict", "detail": "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Unable to create allocation for 'VCPU' on resource provider '3f9d0deb-936c-474a-bdee-d3df049f073d'. The requested amount would exceed the capacity. ", "code": "placement.undefined_code", "request_id": "req-e9f12a3a-3136-4501-8bd6-4add31f0eb82"}]}

But then the nova scheduler log still complains about 3f9d0deb-936c-474a-bdee-d3df049f073d instead of 0e0d8ec8-bb31-4da5-a813-bd73560ff7d6. I think we are looking at two different requests here as the request id in the nova-scheduler log req-3106f4da-1df9-4370-b56b-8ba6b62980dc does not match with the request id of the placement log req-f4d74abc-7b18-407a-85e7-f1c268bd5e53.

...

I really can’t figure out where this, what’s seems to be last minute, calculation of used resources comes from.

Given you also have an Ussuri deployment, you could call the nova-audit command to see whether you would have orphaned allocations : nova-manage placement audit [--verbose] [--delete] [--resource_provider <uuid>]

When running this command, it says the UUID does not exist.

Thank you! I truly appreciate everyones help.

-Jeff M

1465

Age (days ago)

1467

Last active (days ago)

List overview

Download

14 comments

5 participants

participants (5)

Balazs Gibizer
Jeffrey Mazzone
Laurent Dumont
melanie witt
Sylvain Bauza

[nova][placement] Openstack only building one VM per machine in cluster, then runs out of resources

tags

participants (5)