On Thu, Jul 1, 2021 at 01:13, Jeffrey Mazzone <jmazzone@uchicago.edu> wrote:
On Jun 30, 2021, at 5:06 PM, melanie witt <melwittt@gmail.com> wrote:
I suggest you run the 'openstack resource provider show <RP UUID> --allocations' command as Balazs mentioned earlier to show all of the allocations (used resources) on the compute node. I also suggest you run the 'nova-manage placement audit' tool [1] as Sylvain mentioned earlier to show whether there are any orphaned allocations, i.e. allocations that are for instances that no longer exist. The consumer UUID is the instance UUID.
I did both of those suggestions. "openstack resource provider show <RP UUID> —allocations" shows what is expected. No additional orphaned vms and the resources used is correct. Here is an example of a different set of hosts and zones. This host had 2x 16 core vms on it before the cluster went into this state. You can see them both below. The nova-manage audit commands do not show any orphans either.
~# openstack resource provider show 41ecee2a-ec24-48e5-8b9d-24065d67238a --allocations +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Field | Value
| +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | uuid | 41ecee2a-ec24-48e5-8b9d-24065d67238a
| | name | kh09-56
| | generation | 55
| | root_provider_uuid | 41ecee2a-ec24-48e5-8b9d-24065d67238a
| | parent_provider_uuid | None
| | allocations | {'d6b9d19c-1ba9-44c2-97ab-90098509b872': {'resources': {'DISK_GB': 50, 'MEMORY_MB': 16384, 'VCPU': 16}, 'consumer_generation': 1}, 'e0a8401a-0bb6-4612-a496-6a794ebe6cd0': {'resources': {'DISK_GB': 50, 'MEMORY_MB': 16384, 'VCPU': 16}, 'consumer_generation': 1}} | +----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Usage on the resource provider: ~# openstack resource provider usage show 41ecee2a-ec24-48e5-8b9d-24065d67238a +----------------+-------+ | resource_class | usage | +----------------+-------+ | VCPU | 32 | | MEMORY_MB | 32768 | | DISK_GB | 100 | +----------------+-------+
All of that looks correct. Requesting it to check allocations for a 4 VCPU vm also shows it as a candidate: ~# openstack allocation candidate list --resource VCPU=4 | grep 41ecee2a-ec24-48e5-8b9d-24065d67238a | 41 | VCPU=4 | 41ecee2a-ec24-48e5-8b9d-24065d67238a | VCPU=32/1024,MEMORY_MB=32768/772714,DISK_GB=100/7096
In the placement database, under the used column, also shows the correct values for the information provided above with those 2 vms on it: +---------------------+------------+-------+----------------------+--------------------------------------+-------------------+-------+ | created_at | updated_at | id | resource_provider_id | consumer_id | resource_class_id | used | +---------------------+------------+-------+----------------------+--------------------------------------+-------------------+-------+ | 2021-06-02 18:45:05 | NULL | 4060 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 2 | 50 | | 2021-06-02 18:45:05 | NULL | 4061 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 1 | 16384 | | 2021-06-02 18:45:05 | NULL | 4062 | 125 | e0a8401a-0bb6-4612-a496-6a794ebe6cd0 | 0 | 16 | | 2021-06-04 18:39:13 | NULL | 7654 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 2 | 50 | | 2021-06-04 18:39:13 | NULL | 7655 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 1 | 16384 | | 2021-06-04 18:39:13 | NULL | 7656 | 125 | d6b9d19c-1ba9-44c2-97ab-90098509b872 | 0 | 16 |
Trying to build a vm though.. I get the placement error with the improperly calculated “Used” values.
2021-06-30 19:51:39.732 43832 WARNING placement.objects.allocation [req-de225c66-8297-4b34-9380-26cf9385d658 a770bde56c9d49e68facb792cf69088c 6da06417e0004cbb87c1e64fe1978de5 - default default] Over capacity for VCPU on resource provider b749130c-a368-4332-8a1f-8411851b4b2a. Needed: 4, Used: 18509, Capacity: 1024.0
Again you confirmed that the compute RP 41ecee2a-ec24-48e5-8b9d-24065d67238a has a consistent resource view but placement warns about another compute b749130c-a368-4332-8a1f-8411851b4b2a. Could you try to trace through one single situation? Try to boot a VM that results in the error with the placement over capacity warning. Then collect the resource view of the compute RP the placement warning points at. If the result of such tracing is not showing the reason then you can dig the placement code. The placement warning comes from https://github.com/openstack/placement/blob/f77a7f9928d1156450c48045c48597b2... top of that function there is an SQL command you can try to apply to your DB and the resource provider placement warns about to see where the used value are coming from. Cheers, gibi
Outside of changing the allocation ratio, im completely lost. Im confident it has to do with that improper calculation of the used value but how is it being calculated if it isn’t being added up from fixed values in the database as has been suggested?
Thanks in advance! -Jeff M
The tl;dr on how the value is calculated is there's a table called 'allocations' in the placement database that holds all the values for resource providers and resource classes and it has a 'used' column. If you add up all of the 'used' values for a resource class (VCPU) and resource provider (compute node) then that will be the total used of that resource on that resource provider. You can see this data by 'openstack resource provider show <RP UUID> --allocations' as well.
The allocation ratio will not affect the value of 'used' but it will affect the working value of 'total' to be considered higher than it actually is in order to oversubscribe. If a compute node has 64 cores and cpu_allocation ratio is 16 then 64 * 16 = 1024 cores will be allowed for placement on that compute node.
You likely have "orphaned" allocations for the compute node/resource provider that are not mapped to instances any more and you can use 'nova-manage placement audit' to find those and optionally delete them. Doing that will cleanup your resource provider. First, I would run it without specifying --delete just to see what it shows without modifying anything.