[Openstack-operators] [openstack-operators][nova] Verbosity of nova scheduler

Matt Riedemann mriedemos at gmail.com
Wed Jan 10 19:30:57 UTC 2018


On 1/10/2018 1:15 AM, Alec Hothan (ahothan) wrote:
> +1 on the “no valid host found”, this one should be at the very top of 
> the to-be-fixed list.
> 
> Very difficult to troubleshoot filters in lab testing (let alone in 
> production) as there can be many of them. This will get worst with more 
> NFV optimization filters with so many combinations it gets really 
> complex to debug when a VM cannot be launched with NFV optimized 
> flavors. With the scheduler filtering engine, there should be a 
> systematic way to log the reason for not finding a valid host - at the 
> very least the error should display which filter failed as an ERROR (and 
> not as DEBUG).
> 
> We really need to avoid deploying with DEBUG log level but unfortunately 
> this is the only way to troubleshoot. Too many debug-level logs are for 
> development debug (meaning pretty much useless in any circumstances – 
> developer forgot to remove before commit of the feature), many errors 
> that should be logged as ERROR but have been logged as DEBUG only.
> 

The scheduler logs do print which filter returned 0 hosts for a given 
request at INFO level.

For example, I was debugging this NoValidHost failure in a CI run:

http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/job-output.txt.gz#_2018-01-05_17_39_54_336999

And tracing the request ID to the scheduler logs, and filtering on just 
INFO level logging to simulate what you'd have for the default in 
production, I found the filter that kicked it out here:

http://logs.openstack.org/20/531020/3/check/tempest-full/1287dde/controller/logs/screen-n-sch.txt.gz?level=INFO#_Jan_05_17_00_30_564582

And there is a summary like this:

Jan 05 17:00:30.565996 ubuntu-xenial-infracloud-chocolate-0001705073 
nova-scheduler[8932]: INFO nova.filters [None 
req-737984ae-3ae8-4506-a5f9-6655a4ebf206 
tempest-ServersAdminTestJSON-787960229 
tempest-ServersAdminTestJSON-787960229] Filtering removed all hosts for 
the request with instance ID '8ae8dc23-8f3b-4f0f-8775-2dcc2a5fc75b'. 
Filter results: ['RetryFilter: (start: 1, end: 1)', 
'AvailabilityZoneFilter: (start: 1, end: 1)', 'ComputeFilter: (start: 1, 
end: 1)', 'ComputeCapabilitiesFilter: (start: 1, end: 1)', 
'ImagePropertiesFilter: (start: 1, end: 1)', 
'ServerGroupAntiAffinityFilter: (start: 1, end: 1)', 
'ServerGroupAffinityFilter: (start: 1, end: 1)', 'SameHostFilter: 
(start: 1, end: 0)']

"end: 0" means that's the filter that rejected the request. Without 
digging into the actual code, the descriptions for the filters is here:

https://docs.openstack.org/nova/latest/user/filter-scheduler.html

Now just why this request failed requires a bit of understanding of why 
my environment looks like (this CI run is using the CachingScheduler), 
so there isn't a simple "ERROR: SameHostFilter rejected request because 
you're using the CachingScheduler which is racy by design and you 
created the instances in separate requests". You'd have a ton of false 
negative ERRORs in the logs because of valid reasons for rejected a 
request based on the current state of the system, which is going to make 
debugging real issues that much harder.

-- 

Thanks,

Matt



More information about the OpenStack-operators mailing list