[nova] splitting "ERROR" into 2 states

24 Nov 2019

      Hi everyone,

This is just a very open ended discussion that might bring up some
interesting ideas, either that I'm going the wrong way about this, or
perhaps this is something we need to think more about.

As we go about increasing the instrumentation of the clouds we run,
one of the interesting ideas was to measure the "ERROR" instance rate
to see if more VMs than usual are hitting ERROR state.

The problem with this right now is that there are a few factors at the
moment where an instance can hit an ERROR state which are *not* cause
of concern for the operator.

For example, if you're booting an instance from a volume and you're at
your quota, the instance will fail to boot, end up in "ERROR" state.
If we're instrumenting that, we are likely going to be alerted on a
high # of ERROR state instances but there's not much that we can do
about it realistically.

However, if we're getting a lot of ERROR instances because of
"NoValidHost" or because some other valid failures such as in libvirt
or RBD, then we'd probably want to be alerted on those.

Does anyone have any ideas on how we can either better instrument
this, or perhaps seeing how inside Nova, we have a "system error" and
a "user error"

Thanks :)
Mohammed

Mohammed Naser

Matt Riedemann

tags

participants (2)