Open Stack

Mon Nov 25 14:05:41 UTC 2019

On 11/24/2019 3:31 PM, Mohammed Naser wrote:
> For example, if you're booting an instance from a volume and you're at
> your quota, the instance will fail to boot, end up in "ERROR" state.
> If we're instrumenting that, we are likely going to be alerted on a
> high # of ERROR state instances but there's not much that we can do
> about it realistically.
> 

We could probably eliminate most of that specific scenario by checking 
volume quota in the API like we do for port quota so the user would get 
a 403 error rather than one or more instances that failed to build in 
ERROR status.

I know this isn't the gist of your email, but my point is we have 
historically just punted and set instances to ERROR status in a lot of 
cases but that might not necessarily be correct, e.g. [1]. So drilling 
in on common cases where the operation fails and the instance is just 
put to ERROR status is worthwhile IMO. If you reduce the number of times 
an instance goes to ERROR status for predictable reasons, then your 
alerts go down and when you do get alerted you should have a smaller set 
of things that you can reliably filter into an "ignore" bucket, like 
quota-related failures.

> 
> Does anyone have any ideas on how we can either better instrument
> this, or perhaps seeing how inside Nova, we have a "system error" and
> a "user error"

I would think there are also versioned notifications involved in these 
operations with error payloads that you could be inspecting to see if 
it's really something for which you need to be paged. That might get 
pretty whack-a-mole though since lots of operations can fail in lots of 
ways and trying to whitelist that would be hard (see the conversation in 
[2]).

Every operation will have instance action events associated with it as 
well and if one of the events fails, e.g. compute_prep_resize fails due 
to a resize resource claim failure on the dest compute, the exception 
traceback will be recorded in the event and available in the 
os-instance-actions REST API for admins by default policy. So like error 
notifications, mining instance action events might be something to look 
into.

[1] https://bugs.launchpad.net/nova/+bug/1811235
[2] https://bugs.launchpad.net/nova/+bug/1742102

-- 

Thanks,

Matt

Open Stack

[nova] splitting "ERROR" into 2 states

OpenStack

Community

Documentation

Branding & Legal