I first, changed the reallocation ratio to 1:1. In the nova.conf on the controller. Nova would not accept this for some reason and seemed like it needed to be changed on the compute node. So I deleted the hypervisor, resource provider, and compute
service. Changed the ratios on the compute node itself, and then re-added it back in. Now the capacity changed to 64 which is the number of cores on the systems. When starting a vm, it still gets the same number for “used” in the placement-api.log: See below:
2021-06-30 13:49:24.877 4381 WARNING placement.objects.allocation [req-7dc8930f-1eac-401a-ade7-af36e64c2ba8 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider c4199e84-8259-4d0e-9361-9b0d9e6e66b7. Needed: 4, Used: 8206, Capacity: 64.0
With that in mind, I did the same procedure again but set the ratio to 1024
New ratios
~# openstack resource provider inventory list 519c1e10-3546-4e3b-a017-3e831376cde8
+----------------+------------------+----------+----------+----------+-----------+--------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+--------+
| VCPU | 1024.0 | 1 | 64 | 0 | 1 | 64 |
| MEMORY_MB | 1.0 | 1 | 515655 | 512 | 1 | 515655 |
| DISK_GB | 1.0 | 1 | 7096 | 0 | 1 | 7096 |
+----------------+------------------+----------+----------+----------+-----------+--------+
Now I can spin up vms without issues.
I have 1 test AZ with 2 hosts inside. I have set these hosts to the ratio above. I was able to spin up approx 45 4x core VMs without issues and no signs of it hitting an upper limit on the host.
120 | VCPU=64 | 519c1e10-3546-4e3b-a017-3e831376cde8 | VCPU=88/65536
23 | VCPU=64 | 8f97a3ba-98a0-475e-a3cf-41425569b2cb | VCPU=96/65536
I have 2 problems with this fix.
1) the overcommit is now super high and I have no way, besides quotas, to guarantee the system won’t be over provisioned.
2) I still don’t know how that “used” resources value is being calculated. When this issue first started, the “used” resources were a different number. Over the past two days, the used resources for a 4 core virtual machine have remained at 8206
but I have no way to guarantee this.
My initial tests when this started was to compare the resource values when building different size vms. Here is that list:
1 core - 4107
2 core - 4108
4 core- 4110
8 core - 4114
16 core - 4122
32 core - 8234
The number on the right is the number the “used” value used to be. Yesterday and today, it has changed to 8206 for a 4 core vm, I have not tested the rest.
Before I commit to combing through the placement api source code to figure out how the “used” value in the placement log is being calculated, im hoping someone knows where and how that value is being calculated. It does not seem to be a fixed
value in the database and it doesn’t seem to be effected by the allocation ratios.
Thank you in advance!!
-Jeff Mazzone
Senior Linux Systems Administrator
Center for Translational Data Science
University of Chicago.
In
some cases, the DEBUG messages are a bit verbose but can really walk you through the allocation/scheduling process. You could increase it for nova and restart the api + scheduler on the controllers. I wonder if a desync of the DB could be in cause? Are you
running an HA deployment for the mysql backend?