Re: [nova][placement] Openstack only building one VM per machine in cluster, then runs out of resources

1 Jul 2021

      On Thu, Jul 1, 2021 at 01:13, Jeffrey Mazzone <jmazzone@uchicago.edu> 
wrote:
...
...
On Jun 30, 2021, at 5:06 PM, melanie witt <melwittt@gmail.com> wrote:
I suggest you run the 'openstack resource provider show <RP UUID> 
--allocations' command as Balazs mentioned earlier to show all of the 
allocations (used resources) on the compute node. I also suggest you 
run the 'nova-manage placement audit' tool [1] as Sylvain mentioned 
earlier to show whether there are any orphaned allocations, i.e. 
allocations that are for instances that no longer exist. The consumer 
UUID is the instance UUID.
I did both of those suggestions. "openstack resource provider show 
<RP UUID> —allocations" shows what is expected. No additional 
orphaned vms and the resources used is correct. Here is an example of 
a different set of hosts and zones. This host had 2x 16 core vms on 
it before the cluster went into this state. You can see them both 
below. The nova-manage audit commands do not show any orphans either.
~# openstack resource  provider show 
41ecee2a-ec24-48e5-8b9d-24065d67238a --allocations
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field                | Value
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| uuid                 | 41ecee2a-ec24-48e5-8b9d-24065d67238a
|
| name                 | kh09-56
|
| generation           | 55
|
| root_provider_uuid   | 41ecee2a-ec24-48e5-8b9d-24065d67238a
|
| parent_provider_uuid | None
|
| allocations          | {'d6b9d19c-1ba9-44c2-97ab-90098509b872': 
{'resources': {'DISK_GB': 50, 'MEMORY_MB': 16384, 'VCPU': 16}, 
'consumer_generation': 1}, 'e0a8401a-0bb6-4612-a496-6a794ebe6cd0': 
{'resources': {'DISK_GB': 50, 'MEMORY_MB': 16384, 'VCPU': 16}, 
'consumer_generation': 1}} |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Usage on the resource provider:
~# openstack resource  provider usage show 
41ecee2a-ec24-48e5-8b9d-24065d67238a
+----------------+-------+
| resource_class | usage |
+----------------+-------+
| VCPU           |    32 |
| MEMORY_MB      | 32768 |
| DISK_GB        |   100 |
+----------------+-------+
All of that looks correct. Requesting it to check allocations for a 4 
VCPU vm also shows it as a candidate:
~# openstack allocation candidate list --resource VCPU=4 | grep 
41ecee2a-ec24-48e5-8b9d-24065d67238a
|  41 | VCPU=4     | 41ecee2a-ec24-48e5-8b9d-24065d67238a | 
VCPU=32/1024,MEMORY_MB=32768/772714,DISK_GB=100/7096
In the placement database, under the used column, also shows the 
correct values for the information provided above with those 2 vms on 
it:
+---------------------+------------+-------+----------------------+--------------------------------------+-------------------+-------+
| created_at          | updated_at | id    | resource_provider_id | 
consumer_id                          | resource_class_id | used  |
+---------------------+------------+-------+----------------------+--------------------------------------+-------------------+-------+
| 2021-06-02 18:45:05 | NULL       |  4060 |                  125 | 
e0a8401a-0bb6-4612-a496-6a794ebe6cd0 |                 2 |    50 |
| 2021-06-02 18:45:05 | NULL       |  4061 |                  125 | 
e0a8401a-0bb6-4612-a496-6a794ebe6cd0 |                 1 | 16384 |
| 2021-06-02 18:45:05 | NULL       |  4062 |                  125 | 
e0a8401a-0bb6-4612-a496-6a794ebe6cd0 |                 0 |    16 |
| 2021-06-04 18:39:13 | NULL       |  7654 |                  125 | 
d6b9d19c-1ba9-44c2-97ab-90098509b872 |                 2 |    50 |
| 2021-06-04 18:39:13 | NULL       |  7655 |                  125 | 
d6b9d19c-1ba9-44c2-97ab-90098509b872 |                 1 | 16384 |
| 2021-06-04 18:39:13 | NULL       |  7656 |                  125 | 
d6b9d19c-1ba9-44c2-97ab-90098509b872 |                 0 |    16 |
Trying to build a vm though.. I get the placement error with the 
improperly calculated “Used” values.
2021-06-30 19:51:39.732 43832 WARNING placement.objects.allocation 
[req-de225c66-8297-4b34-9380-26cf9385d658 
a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - 
default default] Over capacity for VCPU on resource provider 
b749130c-a368-4332-8a1f-8411851b4b2a. Needed: 4, Used: 18509, 
Capacity: 1024.0
Again you confirmed that the compute RP 
41ecee2a-ec24-48e5-8b9d-24065d67238a has a consistent resource view but 
placement warns about another compute 
b749130c-a368-4332-8a1f-8411851b4b2a.

Could you try to trace through one single situation?

Try to boot a VM that results in the error with the placement over 
capacity warning. Then collect the resource view of the compute RP the 
placement warning points at.

If the result of such tracing is not showing the reason then you can 
dig the placement code. The placement warning comes from 
https://github.com/openstack/placement/blob/f77a7f9928d1156450c48045c48597b2... 
top of that function there is an SQL command you can try to apply to 
your DB and the resource provider placement warns about to see where 
the used value are coming from.

Cheers,
gibi
...
Outside of changing the allocation ratio, im completely lost. Im 
confident it has to do with that improper calculation of the used 
value but how is it being calculated if it isn’t being added up 
from fixed values in the database as has been suggested?
Thanks in advance!
-Jeff M
...
The tl;dr on how the value is calculated is there's a table called 
'allocations' in the placement database that holds all the values 
for resource providers and resource classes and it has a 'used' 
column. If you add up all of the 'used' values for a resource class 
(VCPU) and resource provider (compute node) then that will be the 
total used of that resource on that resource provider. You can see 
this data by 'openstack resource provider show <RP UUID> 
--allocations' as well.
The allocation ratio will not affect the value of 'used' but it will 
affect the working value of 'total' to be considered higher than it 
actually is in order to oversubscribe. If a compute node has 64 
cores and cpu_allocation ratio is 16 then 64 * 16 = 1024 cores will 
be allowed for placement on that compute node.
You likely have "orphaned" allocations for the compute node/resource 
provider that are not mapped to instances any more and you can use 
'nova-manage placement audit' to find those and optionally delete 
them. Doing that will cleanup your resource provider. First, I would 
run it without specifying --delete just to see what it shows without 
modifying anything.

Re: [nova][placement] Openstack only building one VM per machine in cluster, then runs out of resources

Balazs Gibizer