[openstack-dev] [nova] How to debug no valid host failures with placement
Jay Pipes
jaypipes at gmail.com
Mon Aug 6 13:48:34 UTC 2018
On 08/04/2018 07:35 PM, Michael Glasgow wrote:
> On 8/2/2018 7:27 PM, Jay Pipes wrote:
>> It's not an exception. It's normal course of events. NoValidHosts
>> means there were no compute nodes that met the requested resource
>> amounts.
>
> To clarify, I didn't mean a python exception.
Neither did I. I was referring to exceptional behaviour, not a Python
exception.
> I concede that I should've chosen a better word for the type of
> object I have in mind.
>
>> If a SELECT statement against an Oracle DB returns 0 rows, is that an
>> exception? No. Would an operator need to re-send the SELECT statement
>> with an EXPLAIN SELECT in order to get information about what indexes
>> were used to winnow the result set (to zero)? Yes. Either that, or the
>> operator would need to gradually re-execute smaller SELECT statements
>> containing fewer filters in order to determine which join or predicate
>> caused a result set to contain zero rows.
>
> I'm not sure if this analogy fully appreciates the perspective of the
> operator. You're correct of course that if you select on a db and the
> correct answer is zero rows, then zero rows is the right answer, 100% of
> the time.
>
> Whereas what I thought we meant when we talk about "debugging no valid
> host failures" is that zero rows is *not* the right answer, and yet
> you're getting zero rows anyway.
No, "debugging no valid host failures" doesn't mean that zero rows is
the wrong answer. It means "find out why Nova thinks there's nowhere
that my instance will fit".
> So yes, absolutely with an Oracle DB you would get an ORA-XXXXX
> exception in that case, along with a trace file that told you where
> things went off the rails. Which is exactly what we don't have
> here.
That is precisely the opposite of what I was saying. Again, getting no
results is *not* an error. It's normal behaviour and indicates there
were no compute hosts that met the requirements of the request. This is
not an error or exceptional behaviour. It's simply the result of a query
against the placement database.
If you get zero rows returned, that means you need to determine what
part of your request caused the winnowed result set to go from >0 rows
to 0 rows.
And what we've been discussing is exactly the process by which such an
investigation could be done. There are two options: do the investigation
*inline* as part of the original request or do it *offline* after the
original request returns 0 rows.
Doing it inline means splitting the large query we currently construct
into multiple queries (for each related group of requested resources
and/or traits) and logging the number of results grabbed for each of
those queries.
Doing if offline means developing some diagnostic tool that an operator
could run (similar to what Windriver did with [1]). The issue with that
is that the diagnostic tool can only represent the resource usage at the
time the diagnostic tool was run, not when the original request that
returned 0 rows ran.
[1]
https://github.com/starlingx-staging/stx-nova/commit/71acfeae0d1c59fdc77704527d763bd85a276f9a#diff-94f87e728df6465becce5241f3da53c8R330
> If I understand your perspective correctly, it's basically that
> placement is working as designed, so there's nothing more to do except
> pore over debug output. Can we consider:
>
> (1) that might not always be true if there are bugs
Bugs in the placement service are an entirely separate issue. They do
occur, of course, but we're not talking about that here.
> (2) even when it is technically true, from the user's perspective, I'd
> posit that it's rare that a user requests an instance with the express
> intent of not launching an instance. (?) If they're "debugging" this
> issue, it means there's a misconfiguration or some unexpected state that
> they have to go find.
Depends on what you have in mind as a "user". If I launch an instance in
an AWS region, I'd be very surprised if the service told me there was
nowhere to place my instance unless of course I'd asked it to launch an
instance with requirements that exceeded AWS' ability to launch.
If you're talking about a user of a private IT cloud with a single rack
of compute hosts, that user might very well expect to see a return of
"sorry mate, there's nowhere to put your request right now.".
There is no explicit or implicit SLA or guarantee that Nova needs to
somehow create a place to put an instance when no such place exists to
put the instance.
Best,
-jay
More information about the OpenStack-dev
mailing list