Open Stack

Thu Aug 2 14:55:03 UTC 2018

On 08/02/2018 04:10 AM, Chris Dent wrote:

> When people ask for something like what Chris mentioned:
>
>      hosts with enough CPU: <list1>
>      hosts that also have enough disk: <list2>
>      hosts that also have enough memory: <list3>
>      hosts that also meet extra spec host aggregate keys: <list 4>
>      hosts that also meet image properties host aggregate keys: <list 5>
>      hosts that also have requested PCI devices: <list 6>
>
> What are the operational questions that people are trying to answer
> with those results? Is the idea to be able to have some insight into
> the resource usage and reporting on and from the various hosts and
> discover that things are being used differently than thought? Is
> placement a resource monitoring tool, or is it more simple and
> focused than that? Or is it that we might have flavors or other
> resource requesting constraints that have bad logic and we want to
> see at what stage the failure is?  I don't know and I haven't really
> seen it stated explicitly here, and knowing it would help.
>
> Do people want info like this for requests as they happen, or to be
> able to go back later and try the same request again with some flag
> on that says: "diagnose what happened"?
>
> Or to put it another way: Before we design something that provides
> the information above, which is a solution to an undescribed
> problem, can we describe the problem more completely first to make
> sure that what solution we get is the right one. The thing above,
> that set of information, is context free.

The reason my organization added additional failure-case logging to the 
pre-placement scheduler was that we were enabling complex features (cpu pinning, 
hugepages, PCI, SRIOV, CPU model requests, NUMA topology, etc.) and we were 
running into scheduling failures, and people were asking the question "why did 
this scheduler request fail to find a valid host?".

There are a few reasons we might want to ask this question.  Some of them include:

1) double-checking the scheduler is working properly when first using additional 
features
2) weeding out images/flavors with excessive or mutually-contradictory constraints
3) determining whether the cluster needs to be reconfigured to meet user 
requirements

I suspect that something like "do the same request again with a debug flag" 
would cover many scenarios.  I suspect its main weakness would be dealing with 
contention between short-lived entities.

Chris

Open Stack

[openstack-dev] [nova] How to debug no valid host failures with placement

OpenStack

Community

Documentation

Branding & Legal