<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 5, 2019 at 10:21 PM Matt Riedemann <<a href="mailto:mriedemos@gmail.com">mriedemos@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 7/5/2019 1:45 AM, Massimo Sgaravatto wrote:<br>

> I tried to check the allocations on each compute node of a Ocata cloud, <br>

> using the command:<br>

> <br>

> curl -s ${PLACEMENT_ENDPOINT}/resource_providers/${UUID}/allocations -H <br>

> "x-auth-token: $TOKEN"  | python -m json.tool<br>

><br>

<br>

Just FYI you can use osc-placement (openstack client plugin) for command <br>

line:<br>

<br>

<a href="https://docs.openstack.org/osc-placement/latest/index.html" rel="noreferrer" target="_blank">https://docs.openstack.org/osc-placement/latest/index.html</a><br>

<br>

> I found that, on a few compute nodes, there are some instances for which <br>

> there is not a corresponding allocation.<br>

<br>

The heal_allocations command [1] might be able to find and fix these up <br>

for you. The bad news for you is that heal_allocations wasn't added <br>

until Rocky and you're on Ocata. The good news is you should be able to <br>

take the current version of the code from master (or stein) and run that <br>

in a container or virtual environment against your Ocata cloud (this <br>

would be particularly useful if you want to use the --dry-run or <br>

--instance options added in Train). You could also potentially backport <br>

those changes to your internal branch, or we could start a discussion <br>

upstream about backporting that tooling to stable branches - though <br>

going to Ocata might be a bit much at this point given Ocata and Pike <br>

are in extended maintenance mode [2].<br>

<br>

As for *why* the instances on those nodes are missing allocations, it's <br>

hard to say without debugging things. The allocation and resource <br>

tracking code has changed quite a bit since Ocata (in Pike the scheduler <br>

started creating the allocations but the resource tracker in the compute <br>

service could still overwrite those allocations if you had older nodes <br>

during a rolling upgrade). My guess would be a migration failed or there <br>

was just a bug in Ocata where we didn't cleanup or allocate properly. <br>

Again, heal_allocations should add the missing allocation for you if you <br>

can setup the environment to run that command.<br>

<br>

> <br>

> On another Rocky cloud, we had the opposite problem: there were <br>

> allocations also for some instances that didn't exist anymore.<br>

> And this caused problems since we were not able to use all the resources <br>

> of the relevant compute nodes: we had to manually remove the fwrong" <br>

> allocations to fix the problem ...<br>

<br>

Yup, this could happen for different reasons, usually all due to known <br>

bugs for which you don't have the fix yet, e.g. [3][4], or something is <br>

failing during a migration and we aren't cleaning up properly (an <br>

unreported/not-yet-fixed bug).<br>

<br>

> <br>

> <br>

> I wonder why/how this problem can happen ...<br>

<br>

I mentioned some possibilities above - but I'm sure there are other bugs <br>

that have been fixed which I've omitted here, or things that aren't <br>

fixed yet, especially in failure scenarios (rollback/cleanup handling is <br>

hard).<br>

<br>

Note that your Ocata and Rocky cases could be different because since <br>

Queens (once all compute nodes are >=Queens) during resize, cold and <br>

live migration the migration record in nova holds the source node <br>

allocations during the migration so the actual *consumer* of the <br>

allocations for a provider in placement might not be an instance <br>

(server) record but actually a migration, so if you were looking for an <br>

allocation consumer by ID in nova using something like "openstack server <br>

show $consumer_id" it might return NotFound because the consumer is <br>

actually not an instance but a migration record and the allocation was <br>

leaked.<br>

<br>

> <br>

> And how can we fix the issue ? Should we manually add the missing <br>

> allocations / manually remove the wrong ones ?<br>

<br>

Coincidentally a thread related to this [5] re-surfaced a couple of <br>

weeks ago. I am not sure what Sylvain's progress is on that audit tool, <br>

but the linked bug in that email has some other operator scripts you <br>

could try for the case that there are leaked/orphaned allocations on <br>

compute nodes that no longer have instances.<br>

<br></blockquote><div><br></div><div>Yeah, I'm fighting off with the change due to some issues, but I'll hopefully upload the change by the next days.</div><div>-Sylvain</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

> <br>

> Thanks, Massimo<br>

> <br>

> <br>

<br>

[1] <a href="https://docs.openstack.org/nova/latest/cli/nova-manage.html#placement" rel="noreferrer" target="_blank">https://docs.openstack.org/nova/latest/cli/nova-manage.html#placement</a><br>

[2] <a href="https://docs.openstack.org/project-team-guide/stable-branches.html" rel="noreferrer" target="_blank">https://docs.openstack.org/project-team-guide/stable-branches.html</a><br>

[3] <a href="https://bugs.launchpad.net/nova/+bug/1825537" rel="noreferrer" target="_blank">https://bugs.launchpad.net/nova/+bug/1825537</a><br>

[4] <a href="https://bugs.launchpad.net/nova/+bug/1821594" rel="noreferrer" target="_blank">https://bugs.launchpad.net/nova/+bug/1821594</a><br>

[5] <br>

<a href="http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007241.html" rel="noreferrer" target="_blank">http://lists.openstack.org/pipermail/openstack-discuss/2019-June/007241.html</a><br>

<br>

-- <br>

<br>

Thanks,<br>

<br>

Matt<br>

<br>

</blockquote></div></div>