[Openstack-operators] [openstack-operators][nova] Verbosity of nova scheduler
mriedemos at gmail.com
Wed Jan 10 19:30:57 UTC 2018
On 1/10/2018 1:15 AM, Alec Hothan (ahothan) wrote:
> +1 on the “no valid host found”, this one should be at the very top of
> the to-be-fixed list.
> Very difficult to troubleshoot filters in lab testing (let alone in
> production) as there can be many of them. This will get worst with more
> NFV optimization filters with so many combinations it gets really
> complex to debug when a VM cannot be launched with NFV optimized
> flavors. With the scheduler filtering engine, there should be a
> systematic way to log the reason for not finding a valid host - at the
> very least the error should display which filter failed as an ERROR (and
> not as DEBUG).
> We really need to avoid deploying with DEBUG log level but unfortunately
> this is the only way to troubleshoot. Too many debug-level logs are for
> development debug (meaning pretty much useless in any circumstances –
> developer forgot to remove before commit of the feature), many errors
> that should be logged as ERROR but have been logged as DEBUG only.
The scheduler logs do print which filter returned 0 hosts for a given
request at INFO level.
For example, I was debugging this NoValidHost failure in a CI run:
And tracing the request ID to the scheduler logs, and filtering on just
INFO level logging to simulate what you'd have for the default in
production, I found the filter that kicked it out here:
And there is a summary like this:
Jan 05 17:00:30.565996 ubuntu-xenial-infracloud-chocolate-0001705073
nova-scheduler: INFO nova.filters [None
tempest-ServersAdminTestJSON-787960229] Filtering removed all hosts for
the request with instance ID '8ae8dc23-8f3b-4f0f-8775-2dcc2a5fc75b'.
Filter results: ['RetryFilter: (start: 1, end: 1)',
'AvailabilityZoneFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1,
end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)',
'ImagePropertiesFilter: (start: 1, end: 1)',
'ServerGroupAntiAffinityFilter: (start: 1, end: 1)',
'ServerGroupAffinityFilter: (start: 1, end: 1)', 'SameHostFilter:
(start: 1, end: 0)']
"end: 0" means that's the filter that rejected the request. Without
digging into the actual code, the descriptions for the filters is here:
Now just why this request failed requires a bit of understanding of why
my environment looks like (this CI run is using the CachingScheduler),
so there isn't a simple "ERROR: SameHostFilter rejected request because
you're using the CachingScheduler which is racy by design and you
created the instances in separate requests". You'd have a ton of false
negative ERRORs in the logs because of valid reasons for rejected a
request based on the current state of the system, which is going to make
debugging real issues that much harder.
More information about the OpenStack-operators