[Openstack-operators] [openstack-operators][nova] Verbosity of nova scheduler

Alec Hothan (ahothan) ahothan at cisco.com
Wed Jan 10 19:49:12 UTC 2018


Matt,

Thanks for clarifying the logs, the older release I was using did not have much information in the scheduler log. I’ll double check on the Newton release to see how they look like.
As you mention, a simple pass/fail result may not always explain why it fails but it is definitely good to know which filter failed.
I still think that a VM failure to launch should be related to 1 ERROR log rather than 1 INFO log. In the example you provide, it is fine to have race conditions that result in a rejection and an ERROR log.

The main problem is that the nova API does not return sufficient detail on the reason for a NoValidHostFound and perhaps that should be fixed at that level. Extending the API to return a reason field which is a json dict that is returned by the various filters  (with more meaningful filter-specific info) will help tremendously (no more need to go through the log to find out why).

Regards,

  Alec



From: Matt Riedemann <mriedemos at gmail.com>
Date: Wednesday, January 10, 2018 at 11:33 AM
To: "openstack-operators at lists.openstack.org" <openstack-operators at lists.openstack.org>
Subject: Re: [Openstack-operators] [openstack-operators][nova] Verbosity of nova scheduler

On 1/10/2018 1:15 AM, Alec Hothan (ahothan) wrote:
+1 on the “no valid host found”, this one should be at the very top of
the to-be-fixed list.
Very difficult to troubleshoot filters in lab testing (let alone in
production) as there can be many of them. This will get worst with more
NFV optimization filters with so many combinations it gets really
complex to debug when a VM cannot be launched with NFV optimized
flavors. With the scheduler filtering engine, there should be a
systematic way to log the reason for not finding a valid host - at the
very least the error should display which filter failed as an ERROR (and
not as DEBUG).
We really need to avoid deploying with DEBUG log level but unfortunately
this is the only way to troubleshoot. Too many debug-level logs are for
development debug (meaning pretty much useless in any circumstances –
developer forgot to remove before commit of the feature), many errors
that should be logged as ERROR but have been logged as DEBUG only.

The scheduler logs do print which filter returned 0 hosts for a given
request at INFO level.

For example, I was debugging this NoValidHost failure in a CI run:

http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/job-output.txt.gz#_2018-01-05_17_39_54_336999

And tracing the request ID to the scheduler logs, and filtering on just
INFO level logging to simulate what you'd have for the default in
production, I found the filter that kicked it out here:

http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/controller/logs/screen-n-sch.txt.gz?level=INFO#_Jan_05_17_00_30_564582

And there is a summary like this:

Jan 05 17:00:30.565996 ubuntu-xenial-infracloud-chocolate-0001705073
nova-scheduler[8932]: INFO nova.filters [None
req-737984ae-3ae8-4506-a5f9-6655a4ebf206
tempest-ServersAdminTestJSON-787960229
tempest-ServersAdminTestJSON-787960229] Filtering removed all hosts for
the request with instance ID '8ae8dc23-8f3b-4f0f-8775-2dcc2a5fc75b'.
Filter results: ['RetryFilter: (start: 1, end: 1)',
'AvailabilityZoneFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1,
end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)',
'ImagePropertiesFilter: (start: 1, end: 1)',
'ServerGroupAntiAffinityFilter: (start: 1, end: 1)',
'ServerGroupAffinityFilter: (start: 1, end: 1)', 'SameHostFilter:
(start: 1, end: 0)']

"end: 0" means that's the filter that rejected the request. Without
digging into the actual code, the descriptions for the filters is here:

https://docs.openstack.org/nova/latest/user/filter-scheduler.html

Now just why this request failed requires a bit of understanding of why
my environment looks like (this CI run is using the CachingScheduler),
so there isn't a simple "ERROR: SameHostFilter rejected request because
you're using the CachingScheduler which is racy by design and you
created the instances in separate requests". You'd have a ton of false
negative ERRORs in the logs because of valid reasons for rejected a
request based on the current state of the system, which is going to make
debugging real issues that much harder.

--

Thanks,

Matt

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators at lists.openstack.org<mailto:OpenStack-operators at lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20180110/5ec002ca/attachment.html>


More information about the OpenStack-operators mailing list