[ops] [nova] [placement] Mismatch between allocations and instances
Matt Riedemann
mriedemos at gmail.com
Fri Jul 5 20:14:34 UTC 2019
On 7/5/2019 1:45 AM, Massimo Sgaravatto wrote:
> I tried to check the allocations on each compute node of a Ocata cloud,
> using the command:
>
> curl -s ${PLACEMENT_ENDPOINT}/resource_providers/${UUID}/allocations -H
> "x-auth-token: $TOKEN" | python -m json.tool
>
Just FYI you can use osc-placement (openstack client plugin) for command
line:
https://docs.openstack.org/osc-placement/latest/index.html
> I found that, on a few compute nodes, there are some instances for which
> there is not a corresponding allocation.
The heal_allocations command [1] might be able to find and fix these up
for you. The bad news for you is that heal_allocations wasn't added
until Rocky and you're on Ocata. The good news is you should be able to
take the current version of the code from master (or stein) and run that
in a container or virtual environment against your Ocata cloud (this
would be particularly useful if you want to use the --dry-run or
--instance options added in Train). You could also potentially backport
those changes to your internal branch, or we could start a discussion
upstream about backporting that tooling to stable branches - though
going to Ocata might be a bit much at this point given Ocata and Pike
are in extended maintenance mode [2].
As for *why* the instances on those nodes are missing allocations, it's
hard to say without debugging things. The allocation and resource
tracking code has changed quite a bit since Ocata (in Pike the scheduler
started creating the allocations but the resource tracker in the compute
service could still overwrite those allocations if you had older nodes
during a rolling upgrade). My guess would be a migration failed or there
was just a bug in Ocata where we didn't cleanup or allocate properly.
Again, heal_allocations should add the missing allocation for you if you
can setup the environment to run that command.
>
> On another Rocky cloud, we had the opposite problem: there were
> allocations also for some instances that didn't exist anymore.
> And this caused problems since we were not able to use all the resources
> of the relevant compute nodes: we had to manually remove the fwrong"
> allocations to fix the problem ...
Yup, this could happen for different reasons, usually all due to known
bugs for which you don't have the fix yet, e.g. [3][4], or something is
failing during a migration and we aren't cleaning up properly (an
unreported/not-yet-fixed bug).
>
>
> I wonder why/how this problem can happen ...
I mentioned some possibilities above - but I'm sure there are other bugs
that have been fixed which I've omitted here, or things that aren't
fixed yet, especially in failure scenarios (rollback/cleanup handling is
hard).
Note that your Ocata and Rocky cases could be different because since
Queens (once all compute nodes are >=Queens) during resize, cold and
live migration the migration record in nova holds the source node
allocations during the migration so the actual *consumer* of the
allocations for a provider in placement might not be an instance
(server) record but actually a migration, so if you were looking for an
allocation consumer by ID in nova using something like "openstack server
show $consumer_id" it might return NotFound because the consumer is
actually not an instance but a migration record and the allocation was
leaked.
>
> And how can we fix the issue ? Should we manually add the missing
> allocations / manually remove the wrong ones ?
Coincidentally a thread related to this [5] re-surfaced a couple of
weeks ago. I am not sure what Sylvain's progress is on that audit tool,
but the linked bug in that email has some other operator scripts you
could try for the case that there are leaked/orphaned allocations on
compute nodes that no longer have instances.
>
> Thanks, Massimo
>
>
[1] https://docs.openstack.org/nova/latest/cli/nova-manage.html#placement
[2] https://docs.openstack.org/project-team-guide/stable-branches.html
[3] https://bugs.launchpad.net/nova/+bug/1825537
[4] https://bugs.launchpad.net/nova/+bug/1821594
[5]
http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007241.html
--
Thanks,
Matt
More information about the openstack-discuss
mailing list