[nova] splitting "ERROR" into 2 states
Matt Riedemann
mriedemos at gmail.com
Mon Nov 25 14:05:41 UTC 2019
On 11/24/2019 3:31 PM, Mohammed Naser wrote:
> For example, if you're booting an instance from a volume and you're at
> your quota, the instance will fail to boot, end up in "ERROR" state.
> If we're instrumenting that, we are likely going to be alerted on a
> high # of ERROR state instances but there's not much that we can do
> about it realistically.
>
We could probably eliminate most of that specific scenario by checking
volume quota in the API like we do for port quota so the user would get
a 403 error rather than one or more instances that failed to build in
ERROR status.
I know this isn't the gist of your email, but my point is we have
historically just punted and set instances to ERROR status in a lot of
cases but that might not necessarily be correct, e.g. [1]. So drilling
in on common cases where the operation fails and the instance is just
put to ERROR status is worthwhile IMO. If you reduce the number of times
an instance goes to ERROR status for predictable reasons, then your
alerts go down and when you do get alerted you should have a smaller set
of things that you can reliably filter into an "ignore" bucket, like
quota-related failures.
>
> Does anyone have any ideas on how we can either better instrument
> this, or perhaps seeing how inside Nova, we have a "system error" and
> a "user error"
I would think there are also versioned notifications involved in these
operations with error payloads that you could be inspecting to see if
it's really something for which you need to be paged. That might get
pretty whack-a-mole though since lots of operations can fail in lots of
ways and trying to whitelist that would be hard (see the conversation in
[2]).
Every operation will have instance action events associated with it as
well and if one of the events fails, e.g. compute_prep_resize fails due
to a resize resource claim failure on the dest compute, the exception
traceback will be recorded in the event and available in the
os-instance-actions REST API for admins by default policy. So like error
notifications, mining instance action events might be something to look
into.
[1] https://bugs.launchpad.net/nova/+bug/1811235
[2] https://bugs.launchpad.net/nova/+bug/1742102
--
Thanks,
Matt
More information about the openstack-discuss
mailing list