I unfortunately can't add much as I don't have an Ussuri cloud to test with. That said, I would be curious about the debug level outputs from both the controller scheduler/placement as well as one compute where a claim could have happened.

The allocation ratio for CPU is 16 by default. That said, you could also leverage pinned CPUs to prevent any overcommit. But that's not as "simple" as telling Openstack not to oversub on CPU cores.

On Wed, Jun 30, 2021 at 5:06 PM Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:
Yes, this is almost exactly what I did. No, I am not running mysql in a HA deployment and I have ran nova-manage api_db sync several times throughout the process below.

 I think I found a work around but im not sure how feasible this is. 

I first, changed the reallocation ratio to 1:1. In the nova.conf on the controller. Nova would not accept this for some reason and seemed like it needed to be changed on the compute node. So I deleted the hypervisor, resource provider, and compute service. Changed the ratios on the compute node itself, and then re-added it back in. Now the capacity changed to 64 which is the number of cores on the systems. When starting a vm, it still gets the same number for “used” in the placement-api.log: See below: 

New ratios
~# openstack resource provider inventory list 554f2a3b-924e-440c-9847-596064ea0f3f
+----------------+------------------+----------+----------+----------+-----------+--------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+--------+
| VCPU | 1.0 | 1 | 64 | 0 | 1 | 64 |
| MEMORY_MB | 1.0 | 1 | 515655 | 512 | 1 | 515655 |
| DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 |
+----------------+------------------+----------+----------+----------+-----------+--------+

Error from placement.log
2021-06-30 13:49:24.877 4381 WARNING placement.objects.allocation [req-7dc8930f-1eac-401a-ade7-af36e64c2ba8 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider c4199e84-8259-4d0e-9361-9b0d9e6e66b7. Needed: 4, Used: 8206, Capacity: 64.0

With that in mind, I did the same procedure again but set the ratio to 1024

New ratios
~# openstack resource provider inventory list 519c1e10-3546-4e3b-a017-3e831376cde8
+----------------+------------------+----------+----------+----------+-----------+--------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+--------+
| VCPU | 1024.0 | 1 | 64 | 0 | 1 | 64 |
| MEMORY_MB | 1.0 | 1 | 515655 | 512 | 1 | 515655 |
| DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 |
+----------------+------------------+----------+----------+----------+-----------+--------+


Now I can spin up vms without issues. 

I have 1 test AZ with 2 hosts inside. I have set these hosts to the ratio above. I was able to spin up approx 45 4x core VMs without issues and no signs of it hitting an upper limit on the host. 

120 | VCPU=64    | 519c1e10-3546-4e3b-a017-3e831376cde8 | VCPU=88/65536
23 | VCPU=64 | 8f97a3ba-98a0-475e-a3cf-41425569b2cb | VCPU=96/65536


I have 2 problems with this fix. 

1) the overcommit is now super high and I have no way, besides quotas, to guarantee the system won’t be over provisioned. 
2) I still don’t know how that “used” resources value is being calculated. When this issue first started, the “used” resources were a different number. Over the past two days, the used resources for a 4 core virtual machine have remained at 8206 but I have no way to guarantee this. 

My initial tests when this started was to compare the resource values when building different size vms. Here is that list: 

1 core - 4107
2 core - 4108
4 core- 4110
8 core - 4114
16 core - 4122
32 core - 8234

The number on the right is the number the “used” value used to be. Yesterday and today, it has changed to 8206 for a 4 core vm, I have not tested the rest. 

Before I commit to combing through the placement api source code to figure out how the “used” value in the placement log is being calculated, im hoping someone knows where and how that value is being calculated. It does not seem to be a fixed value in the database and it doesn’t seem to be effected by the allocation ratios. 


Thank you in advance!!
-Jeff Mazzone
Senior Linux Systems Administrator
Center for Translational Data Science
University of Chicago.



On Jun 30, 2021, at 2:40 PM, Laurent Dumont <laurentfdumont@gmail.com> wrote:

In some cases, the DEBUG messages are a bit verbose but can really walk you through the allocation/scheduling process. You could increase it for nova and restart the api + scheduler on the controllers. I wonder if a desync of the DB could be in cause? Are you running an HA deployment for the mysql backend?