[Openstack-operators] [openstack-operators][nova] Verbosity of nova scheduler
Matt Riedemann
mriedemos at gmail.com
Wed Jan 10 19:30:57 UTC 2018
On 1/10/2018 1:15 AM, Alec Hothan (ahothan) wrote:
> +1 on the “no valid host found”, this one should be at the very top of
> the to-be-fixed list.
>
> Very difficult to troubleshoot filters in lab testing (let alone in
> production) as there can be many of them. This will get worst with more
> NFV optimization filters with so many combinations it gets really
> complex to debug when a VM cannot be launched with NFV optimized
> flavors. With the scheduler filtering engine, there should be a
> systematic way to log the reason for not finding a valid host - at the
> very least the error should display which filter failed as an ERROR (and
> not as DEBUG).
>
> We really need to avoid deploying with DEBUG log level but unfortunately
> this is the only way to troubleshoot. Too many debug-level logs are for
> development debug (meaning pretty much useless in any circumstances –
> developer forgot to remove before commit of the feature), many errors
> that should be logged as ERROR but have been logged as DEBUG only.
>
The scheduler logs do print which filter returned 0 hosts for a given
request at INFO level.
For example, I was debugging this NoValidHost failure in a CI run:
http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/job-output.txt.gz#_2018-01-05_17_39_54_336999
And tracing the request ID to the scheduler logs, and filtering on just
INFO level logging to simulate what you'd have for the default in
production, I found the filter that kicked it out here:
http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/controller/logs/screen-n-sch.txt.gz?level=INFO#_Jan_05_17_00_30_564582
And there is a summary like this:
Jan 05 17:00:30.565996 ubuntu-xenial-infracloud-chocolate-0001705073
nova-scheduler[8932]: INFO nova.filters [None
req-737984ae-3ae8-4506-a5f9-6655a4ebf206
tempest-ServersAdminTestJSON-787960229
tempest-ServersAdminTestJSON-787960229] Filtering removed all hosts for
the request with instance ID '8ae8dc23-8f3b-4f0f-8775-2dcc2a5fc75b'.
Filter results: ['RetryFilter: (start: 1, end: 1)',
'AvailabilityZoneFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1,
end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)',
'ImagePropertiesFilter: (start: 1, end: 1)',
'ServerGroupAntiAffinityFilter: (start: 1, end: 1)',
'ServerGroupAffinityFilter: (start: 1, end: 1)', 'SameHostFilter:
(start: 1, end: 0)']
"end: 0" means that's the filter that rejected the request. Without
digging into the actual code, the descriptions for the filters is here:
https://docs.openstack.org/nova/latest/user/filter-scheduler.html
Now just why this request failed requires a bit of understanding of why
my environment looks like (this CI run is using the CachingScheduler),
so there isn't a simple "ERROR: SameHostFilter rejected request because
you're using the CachingScheduler which is racy by design and you
created the instances in separate requests". You'd have a ton of false
negative ERRORs in the logs because of valid reasons for rejected a
request based on the current state of the system, which is going to make
debugging real issues that much harder.
--
Thanks,
Matt
More information about the OpenStack-operators
mailing list