<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 5, 2019 at 10:21 PM Matt Riedemann <<a href="mailto:mriedemos@gmail.com">mriedemos@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 7/5/2019 1:45 AM, Massimo Sgaravatto wrote:<br>
> I tried to check the allocations on each compute node of a Ocata cloud, <br>
> using the command:<br>
> <br>
> curl -s ${PLACEMENT_ENDPOINT}/resource_providers/${UUID}/allocations -H <br>
> "x-auth-token: $TOKEN" | python -m json.tool<br>
><br>
<br>
Just FYI you can use osc-placement (openstack client plugin) for command <br>
line:<br>
<br>
<a href="https://docs.openstack.org/osc-placement/latest/index.html" rel="noreferrer" target="_blank">https://docs.openstack.org/osc-placement/latest/index.html</a><br>
<br>
> I found that, on a few compute nodes, there are some instances for which <br>
> there is not a corresponding allocation.<br>
<br>
The heal_allocations command [1] might be able to find and fix these up <br>
for you. The bad news for you is that heal_allocations wasn't added <br>
until Rocky and you're on Ocata. The good news is you should be able to <br>
take the current version of the code from master (or stein) and run that <br>
in a container or virtual environment against your Ocata cloud (this <br>
would be particularly useful if you want to use the --dry-run or <br>
--instance options added in Train). You could also potentially backport <br>
those changes to your internal branch, or we could start a discussion <br>
upstream about backporting that tooling to stable branches - though <br>
going to Ocata might be a bit much at this point given Ocata and Pike <br>
are in extended maintenance mode [2].<br>
<br>
As for *why* the instances on those nodes are missing allocations, it's <br>
hard to say without debugging things. The allocation and resource <br>
tracking code has changed quite a bit since Ocata (in Pike the scheduler <br>
started creating the allocations but the resource tracker in the compute <br>
service could still overwrite those allocations if you had older nodes <br>
during a rolling upgrade). My guess would be a migration failed or there <br>
was just a bug in Ocata where we didn't cleanup or allocate properly. <br>
Again, heal_allocations should add the missing allocation for you if you <br>
can setup the environment to run that command.<br>
<br>
> <br>
> On another Rocky cloud, we had the opposite problem: there were <br>
> allocations also for some instances that didn't exist anymore.<br>
> And this caused problems since we were not able to use all the resources <br>
> of the relevant compute nodes: we had to manually remove the fwrong" <br>
> allocations to fix the problem ...<br>
<br>
Yup, this could happen for different reasons, usually all due to known <br>
bugs for which you don't have the fix yet, e.g. [3][4], or something is <br>
failing during a migration and we aren't cleaning up properly (an <br>
unreported/not-yet-fixed bug).<br>
<br>
> <br>
> <br>
> I wonder why/how this problem can happen ...<br>
<br>
I mentioned some possibilities above - but I'm sure there are other bugs <br>
that have been fixed which I've omitted here, or things that aren't <br>
fixed yet, especially in failure scenarios (rollback/cleanup handling is <br>
hard).<br>
<br>
Note that your Ocata and Rocky cases could be different because since <br>
Queens (once all compute nodes are >=Queens) during resize, cold and <br>
live migration the migration record in nova holds the source node <br>
allocations during the migration so the actual *consumer* of the <br>
allocations for a provider in placement might not be an instance <br>
(server) record but actually a migration, so if you were looking for an <br>
allocation consumer by ID in nova using something like "openstack server <br>
show $consumer_id" it might return NotFound because the consumer is <br>
actually not an instance but a migration record and the allocation was <br>
leaked.<br>
<br>
> <br>
> And how can we fix the issue ? Should we manually add the missing <br>
> allocations / manually remove the wrong ones ?<br>
<br>
Coincidentally a thread related to this [5] re-surfaced a couple of <br>
weeks ago. I am not sure what Sylvain's progress is on that audit tool, <br>
but the linked bug in that email has some other operator scripts you <br>
could try for the case that there are leaked/orphaned allocations on <br>
compute nodes that no longer have instances.<br>
<br></blockquote><div><br></div><div>Yeah, I'm fighting off with the change due to some issues, but I'll hopefully upload the change by the next days.</div><div>-Sylvain</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
> <br>
> Thanks, Massimo<br>
> <br>
> <br>
<br>
[1] <a href="https://docs.openstack.org/nova/latest/cli/nova-manage.html#placement" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/cli/nova-manage.html#placement</a><br>
[2] <a href="https://docs.openstack.org/project-team-guide/stable-branches.html" rel="noreferrer" target="_blank">https://docs.openstack.org/project-team-guide/stable-branches.html</a><br>
[3] <a href="https://bugs.launchpad.net/nova/+bug/1825537" rel="noreferrer" target="_blank">https://bugs.launchpad.net/nova/+bug/1825537</a><br>
[4] <a href="https://bugs.launchpad.net/nova/+bug/1821594" rel="noreferrer" target="_blank">https://bugs.launchpad.net/nova/+bug/1821594</a><br>
[5] <br>
<a href="http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007241.html" rel="noreferrer" target="_blank">http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007241.html</a><br>
<br>
-- <br>
<br>
Thanks,<br>
<br>
Matt<br>
<br>
</blockquote></div></div>