[openstack-dev] [nova] How to debug no valid host failures with placement
Michael Glasgow
michael.glasgow at oracle.com
Sat Aug 4 23:35:44 UTC 2018
On 8/2/2018 7:27 PM, Jay Pipes wrote:
> It's not an exception. It's normal course of events. NoValidHosts means
> there were no compute nodes that met the requested resource amounts.
To clarify, I didn't mean a python exception. I concede that I
should've chosen a better word for the type of object I have in mind.
> If a SELECT statement against an Oracle DB returns 0 rows, is that an
> exception? No. Would an operator need to re-send the SELECT statement
> with an EXPLAIN SELECT in order to get information about what indexes
> were used to winnow the result set (to zero)? Yes. Either that, or the
> operator would need to gradually re-execute smaller SELECT statements
> containing fewer filters in order to determine which join or predicate
> caused a result set to contain zero rows.
I'm not sure if this analogy fully appreciates the perspective of the
operator. You're correct of course that if you select on a db and the
correct answer is zero rows, then zero rows is the right answer, 100% of
the time.
Whereas what I thought we meant when we talk about "debugging no valid
host failures" is that zero rows is *not* the right answer, and yet
you're getting zero rows anyway. So yes, absolutely with an Oracle DB
you would get an ORA-XXXXX exception in that case, along with a trace
file that told you where things went off the rails. Which is exactly
what we don't have here.
If I understand your perspective correctly, it's basically that
placement is working as designed, so there's nothing more to do except
pore over debug output. Can we consider:
(1) that might not always be true if there are bugs
(2) even when it is technically true, from the user's perspective, I'd
posit that it's rare that a user requests an instance with the express
intent of not launching an instance. (?) If they're "debugging" this
issue, it means there's a misconfiguration or some unexpected state that
they have to go find. So it is exceptional in that sense, and either
the operator or the user is going to need to know why the request failed
in a large majority of these cases.
I would love to hear from any large operators on the list whether they
feel that "turn on debug and try again" is really acceptable here. I'm
not trying to be critical; I'm just convinced that once the cluster is
of a certain size, that approach can start to become very expensive.
--
Michael Glasgow
More information about the OpenStack-dev
mailing list