[openstack-dev] [nova] How to debug no valid host failures with placement

Jay Pipes jaypipes at gmail.com
Mon Aug 6 13:48:34 UTC 2018


On 08/04/2018 07:35 PM, Michael Glasgow wrote:
> On 8/2/2018 7:27 PM, Jay Pipes wrote:
>> It's not an exception. It's normal course of events. NoValidHosts 
>> means there were no compute nodes that met the requested resource 
>> amounts.
> 
> To clarify, I didn't mean a python exception.

Neither did I. I was referring to exceptional behaviour, not a Python 
exception.

> I concede that I should've chosen a better word for the type of
> object I have in mind.
> 
>> If a SELECT statement against an Oracle DB returns 0 rows, is that an 
>> exception? No. Would an operator need to re-send the SELECT statement 
>> with an EXPLAIN SELECT in order to get information about what indexes 
>> were used to winnow the result set (to zero)? Yes. Either that, or the 
>> operator would need to gradually re-execute smaller SELECT statements 
>> containing fewer filters in order to determine which join or predicate 
>> caused a result set to contain zero rows.
> 
> I'm not sure if this analogy fully appreciates the perspective of the 
> operator.  You're correct of course that if you select on a db and the 
> correct answer is zero rows, then zero rows is the right answer, 100% of 
> the time.
> 
> Whereas what I thought we meant when we talk about "debugging no valid 
> host failures" is that zero rows is *not* the right answer, and yet 
> you're getting zero rows anyway.

No, "debugging no valid host failures" doesn't mean that zero rows is 
the wrong answer. It means "find out why Nova thinks there's nowhere 
that my instance will fit".

> So yes, absolutely with an Oracle DB you would get an ORA-XXXXX
> exception in that case, along with a trace file that told you where
> things went off the rails.  Which is exactly what we don't have
> here.
That is precisely the opposite of what I was saying. Again, getting no 
results is *not* an error. It's normal behaviour and indicates there 
were no compute hosts that met the requirements of the request. This is 
not an error or exceptional behaviour. It's simply the result of a query 
against the placement database.

If you get zero rows returned, that means you need to determine what 
part of your request caused the winnowed result set to go from >0 rows 
to 0 rows.

And what we've been discussing is exactly the process by which such an 
investigation could be done. There are two options: do the investigation 
*inline* as part of the original request or do it *offline* after the 
original request returns 0 rows.

Doing it inline means splitting the large query we currently construct 
into multiple queries (for each related group of requested resources 
and/or traits) and logging the number of results grabbed for each of 
those queries.

Doing if offline means developing some diagnostic tool that an operator 
could run (similar to what Windriver did with [1]). The issue with that 
is that the diagnostic tool can only represent the resource usage at the 
time the diagnostic tool was run, not when the original request that 
returned 0 rows ran.

[1] 
https://github.com/starlingx-staging/stx-nova/commit/71acfeae0d1c59fdc77704527d763bd85a276f9a#diff-94f87e728df6465becce5241f3da53c8R330

> If I understand your perspective correctly, it's basically that 
> placement is working as designed, so there's nothing more to do except 
> pore over debug output.  Can we consider:
> 
>   (1) that might not always be true if there are bugs

Bugs in the placement service are an entirely separate issue. They do 
occur, of course, but we're not talking about that here.

>   (2) even when it is technically true, from the user's perspective, I'd 
> posit that it's rare that a user requests an instance with the express 
> intent of not launching an instance. (?)  If they're "debugging" this 
> issue, it means there's a misconfiguration or some unexpected state that 
> they have to go find.

Depends on what you have in mind as a "user". If I launch an instance in 
an AWS region, I'd be very surprised if the service told me there was 
nowhere to place my instance unless of course I'd asked it to launch an 
instance with requirements that exceeded AWS' ability to launch.

If you're talking about a user of a private IT cloud with a single rack 
of compute hosts, that user might very well expect to see a return of 
"sorry mate, there's nowhere to put your request right now.".

There is no explicit or implicit SLA or guarantee that Nova needs to 
somehow create a place to put an instance when no such place exists to 
put the instance.

Best,
-jay



More information about the OpenStack-dev mailing list