[openstack-dev] [nova] How to debug no valid host failures with placement

Michael Glasgow michael.glasgow at oracle.com
Sat Aug 4 23:35:44 UTC 2018


On 8/2/2018 7:27 PM, Jay Pipes wrote:
> It's not an exception. It's normal course of events. NoValidHosts means 
> there were no compute nodes that met the requested resource amounts.

To clarify, I didn't mean a python exception.  I concede that I 
should've chosen a better word for the type of object I have in mind.

> If a SELECT statement against an Oracle DB returns 0 rows, is that an 
> exception? No. Would an operator need to re-send the SELECT statement 
> with an EXPLAIN SELECT in order to get information about what indexes 
> were used to winnow the result set (to zero)? Yes. Either that, or the 
> operator would need to gradually re-execute smaller SELECT statements 
> containing fewer filters in order to determine which join or predicate 
> caused a result set to contain zero rows.

I'm not sure if this analogy fully appreciates the perspective of the 
operator.  You're correct of course that if you select on a db and the 
correct answer is zero rows, then zero rows is the right answer, 100% of 
the time.

Whereas what I thought we meant when we talk about "debugging no valid 
host failures" is that zero rows is *not* the right answer, and yet 
you're getting zero rows anyway.  So yes, absolutely with an Oracle DB 
you would get an ORA-XXXXX exception in that case, along with a trace 
file that told you where things went off the rails.  Which is exactly 
what we don't have here.

If I understand your perspective correctly, it's basically that 
placement is working as designed, so there's nothing more to do except 
pore over debug output.  Can we consider:

  (1) that might not always be true if there are bugs

  (2) even when it is technically true, from the user's perspective, I'd 
posit that it's rare that a user requests an instance with the express 
intent of not launching an instance. (?)  If they're "debugging" this 
issue, it means there's a misconfiguration or some unexpected state that 
they have to go find.  So it is exceptional in that sense, and either 
the operator or the user is going to need to know why the request failed 
in a large majority of these cases.

I would love to hear from any large operators on the list whether they 
feel that "turn on debug and try again" is really acceptable here.  I'm 
not trying to be critical; I'm just convinced that once the cluster is 
of a certain size, that approach can start to become very expensive.

-- 
Michael Glasgow



More information about the OpenStack-dev mailing list