[openstack-dev] [nova] How to debug no valid host failures with placement
    Jay Pipes 
    jaypipes at gmail.com
       
    Fri Aug  3 00:27:22 UTC 2018
    
    
  
On 08/02/2018 06:18 PM, Michael Glasgow wrote:
> On 08/02/18 15:04, Chris Friesen wrote:
>> On 08/02/2018 01:04 PM, melanie witt wrote:
>>
>>> The problem is an infamous one, which is, your users are trying to boot
>>> instances and they get "No Valid Host" and an instance in ERROR 
>>> state. They contact support, and now support is trying to determine 
>>> why NoValidHost happened. In the past, they would turn on DEBUG log 
>>> level on the nova-scheduler, try another request, and take a look at 
>>> the scheduler logs.
>>
>> At a previous Summit[1] there were some operators that said they just 
>> always ran nova-scheduler with debug logging enabled in order to deal 
>> with this issue, but that it was a pain [...]
> 
> I would go a bit further and say it's likely to be unacceptable on a 
> large cluster.  It's expensive to deal with all those logs and to 
> manually comb through them for troubleshooting this issue type, which 
> can happen frequently with some setups.  Secondarily there are 
> performance and security concerns with leaving debug on all the time.
> 
> As to "defining the problem", I think it's what Melanie said.  It's 
> about asking for X and the system saying, "sorry, can't give you X" with 
> no further detail or even means of discovering it.
> 
> More generally, any time a service fails to deliver a resource which it 
> is primarily designed to deliver, it seems to me at this stage that 
> should probably be taken a bit more seriously than just "check the log 
> file, maybe there's something in there?"  From the user's perspective, 
> if nova fails to produce an instance, or cinder fails to produce a 
> volume, or neutron fails to build a subnet, that's kind of a big deal, 
> right?
> 
> In such cases, would it be possible to generate a detailed exception 
> object which contains all the necessary info to ascertain why that 
> specific failure occurred?
It's not an exception. It's normal course of events. NoValidHosts means 
there were no compute nodes that met the requested resource amounts.
There's plenty of ways the operator can get usage and trait information 
and determine if there are providers that meet the requested amounts and 
required/forbidden traits.
What we're talking about here is debugging information, plain and simple.
If a SELECT statement against an Oracle DB returns 0 rows, is that an 
exception? No. Would an operator need to re-send the SELECT statement 
with an EXPLAIN SELECT in order to get information about what indexes 
were used to winnow the result set (to zero)? Yes. Either that, or the 
operator would need to gradually re-execute smaller SELECT statements 
containing fewer filters in order to determine which join or predicate 
caused a result set to contain zero rows.
That's exactly what we're talking about here. It's not an exception. 
It's debugging information.
Best,
-jay
    
    
More information about the OpenStack-dev
mailing list