[openstack-dev] [nova] How to debug no valid host failures with placement

Jay Pipes jaypipes at gmail.com
Thu Aug 9 19:23:00 UTC 2018


On Wed, Aug 1, 2018 at 11:15 AM, Ben Nemec <openstack at nemebean.com> wrote:

> Hi,
>
> I'm having an issue with no valid host errors when starting instances and
> I'm struggling to figure out why.  I thought the problem was disk space,
> but I changed the disk_allocation_ratio and I'm still getting no valid
> host.  The host does have plenty of disk space free, so that shouldn't be a
> problem.
>
> However, I'm not even sure it's disk that's causing the failures because I
> can't find any information in the logs about why the no valid host is
> happening.  All I get from the scheduler is:
>
> "Got no allocation candidates from the Placement API. This may be a
> temporary occurrence as compute nodes start up and begin reporting
> inventory to the Placement service."
>
> While in placement I see:
>
> 2018-08-01 15:02:22.062 20 DEBUG nova.api.openstack.placement.requestlog
> [req-0a830ce9-e2af-413a-86cb-b47ae129b676 fc44fe5cefef43f4b921b9123c95e694
> b07e6dc2e6284b00ac7070aa3457c15e - default default] Starting request:
> 10.2.2.201 "GET /placement/allocation_candidat
> es?limit=1000&resources=DISK_GB%3A20%2CMEMORY_MB%3A2048%2CVCPU%3A1"
> __call__ /usr/lib/python2.7/site-packages/nova/api/openstack/placemen
> t/requestlog.py:38
> 2018-08-01 15:02:22.103 20 INFO nova.api.openstack.placement.requestlog
> [req-0a830ce9-e2af-413a-86cb-b47ae129b676 fc44fe5cefef43f4b921b9123c95e694
> b07e6dc2e6284b00ac7070aa3457c15e - default default] 10.2.2.201 "GET
> /placement/allocation_candidates?limit=1000&resources=DISK_
> GB%3A20%2CMEMORY_MB%3A2048%2CVCPU%3A1" status: 200 len: 53 microversion:
> 1.25
>
> Basically it just seems to be logging that it got a request, but there's
> no information about what it did with that request.
>
> So where do I go from here?  Is there somewhere else I can look to see why
> placement returned no candidates?
>
>
Hi again, Ben, hope you are enjoying your well-earned time off! :)

I've created a patch that (hopefully) will address some of the difficulty
that folks have had in diagnosing which parts of a request caused all
providers to be filtered out from the return of GET /allocation_candidates:

https://review.openstack.org/#/c/590041

This patch changes two primary things:

1) Query-splitting

The patch splits the existing monster SQL query that was being used for
querying for all providers that matched all requested resources, required
traits, forbidden traits and required aggregate associations into doing
multiple queries, one for each requested resource. While this does increase
the number of database queries executed for each call to GET
/allocation_candidates, the changes allow better visibility into what parts
of the request cause an exhaustion of matching providers. We've benchmarked
the new patch and have shown the performance impact of doing 3 queries
versus 1 (when there is a request for 3 resources -- VCPU, RAM and disk) is
minimal (a few extra milliseconds for execution against a DB with 1K
providers having inventory of all three resource classes).

2) Diagnostic logging output

The patch adds debug log output within each loop iteration, so there is no
logging output that shows how many matching providers were found for each
resource class involved in the request. The output looks like this in the
logs:

[req-2d30faa8-4190-4490-a91e-610045530140] inside VCPU request loop.
before applying trait and aggregate filters, found 12 matching
providers[req-2d30faa8-4190-4490-a91e-610045530140] found 12 providers
with capacity for the requested 1
VCPU.[req-2d30faa8-4190-4490-a91e-610045530140] inside MEMORY_MB
request loop. before applying trait and aggregate filters, found 9
matching providers[req-2d30faa8-4190-4490-a91e-610045530140] found 9
providers with capacity for the requested 64 MEMORY_MB. before loop
iteration we had 12 matches.[req-2d30faa8-4190-4490-a91e-610045530140]
RequestGroup(use_same_provider=False, resources={MEMORY_MB:64,
VCPU:1}, traits=[], aggregates=[]) (suffix '') returned 9 matches

If a request includes required traits, forbidden traits or required
aggregate associations, there are additional log messages showing how many
matching providers were found after applying the trait or aggregate
filtering set operation (in other words, the log output shows the impact of
the trait filter or aggregate filter in much the same way that the existing
FilterScheduler logging shows the "before and after" impact that a
particular filter had on a request process.

Have a look at the patch in question and please feel free to add your
feedback and comments on ways this can be improved to meet your needs.

Best,
-jay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20180809/4879d51a/attachment.html>


More information about the OpenStack-dev mailing list