[openstack-dev] [nova] How to debug no valid host failures with placement

Michael Glasgow michael.glasgow at oracle.com
Thu Aug 2 22:18:33 UTC 2018


On 08/02/18 15:04, Chris Friesen wrote:
> On 08/02/2018 01:04 PM, melanie witt wrote:
> 
>> The problem is an infamous one, which is, your users are trying to boot
>> instances and they get "No Valid Host" and an instance in ERROR state. 
>> They contact support, and now support is trying to determine why 
>> NoValidHost happened. In the past, they would turn on DEBUG log level 
>> on the nova-scheduler, try another request, and take a look at the 
>> scheduler logs.
> 
> At a previous Summit[1] there were some operators that said they just 
> always ran nova-scheduler with debug logging enabled in order to deal 
> with this issue, but that it was a pain [...]

I would go a bit further and say it's likely to be unacceptable on a 
large cluster.  It's expensive to deal with all those logs and to 
manually comb through them for troubleshooting this issue type, which 
can happen frequently with some setups.  Secondarily there are 
performance and security concerns with leaving debug on all the time.

As to "defining the problem", I think it's what Melanie said.  It's 
about asking for X and the system saying, "sorry, can't give you X" with 
no further detail or even means of discovering it.

More generally, any time a service fails to deliver a resource which it 
is primarily designed to deliver, it seems to me at this stage that 
should probably be taken a bit more seriously than just "check the log 
file, maybe there's something in there?"  From the user's perspective, 
if nova fails to produce an instance, or cinder fails to produce a 
volume, or neutron fails to build a subnet, that's kind of a big deal, 
right?

In such cases, would it be possible to generate a detailed exception 
object which contains all the necessary info to ascertain why that 
specific failure occurred?  Ideally the operator should be able to 
correlate those exceptions with associated objects, e.g. the instance in 
ERROR state in this case, so that given that failed instance ID they can 
quickly remedy the user's problem without reading megabytes of log 
files.  If there's a way to make this error handling generic across 
services to some extent, that seems like it would be great for operators.

Such a framework might eventually hook into internal ticketing systems, 
maintenance reporting, or provide a starting point for self healing 
mechanisms, but initially the aim would just be to provide the operator 
with the bare minimum info necessary for more efficient break-fix.

It could be a big investment, but it also doesn't seem like "optional" 
functionality from a large operator's perspective.  "Enable debug and 
try again" is just not good enough IMHO.

-- 
Michael Glasgow



More information about the OpenStack-dev mailing list