[nova] splitting "ERROR" into 2 states
Hi everyone, This is just a very open ended discussion that might bring up some interesting ideas, either that I'm going the wrong way about this, or perhaps this is something we need to think more about. As we go about increasing the instrumentation of the clouds we run, one of the interesting ideas was to measure the "ERROR" instance rate to see if more VMs than usual are hitting ERROR state. The problem with this right now is that there are a few factors at the moment where an instance can hit an ERROR state which are *not* cause of concern for the operator. For example, if you're booting an instance from a volume and you're at your quota, the instance will fail to boot, end up in "ERROR" state. If we're instrumenting that, we are likely going to be alerted on a high # of ERROR state instances but there's not much that we can do about it realistically. However, if we're getting a lot of ERROR instances because of "NoValidHost" or because some other valid failures such as in libvirt or RBD, then we'd probably want to be alerted on those. Does anyone have any ideas on how we can either better instrument this, or perhaps seeing how inside Nova, we have a "system error" and a "user error" Thanks :) Mohammed
On 11/24/2019 3:31 PM, Mohammed Naser wrote:
For example, if you're booting an instance from a volume and you're at your quota, the instance will fail to boot, end up in "ERROR" state. If we're instrumenting that, we are likely going to be alerted on a high # of ERROR state instances but there's not much that we can do about it realistically.
We could probably eliminate most of that specific scenario by checking volume quota in the API like we do for port quota so the user would get a 403 error rather than one or more instances that failed to build in ERROR status. I know this isn't the gist of your email, but my point is we have historically just punted and set instances to ERROR status in a lot of cases but that might not necessarily be correct, e.g. [1]. So drilling in on common cases where the operation fails and the instance is just put to ERROR status is worthwhile IMO. If you reduce the number of times an instance goes to ERROR status for predictable reasons, then your alerts go down and when you do get alerted you should have a smaller set of things that you can reliably filter into an "ignore" bucket, like quota-related failures.
Does anyone have any ideas on how we can either better instrument this, or perhaps seeing how inside Nova, we have a "system error" and a "user error"
I would think there are also versioned notifications involved in these operations with error payloads that you could be inspecting to see if it's really something for which you need to be paged. That might get pretty whack-a-mole though since lots of operations can fail in lots of ways and trying to whitelist that would be hard (see the conversation in [2]). Every operation will have instance action events associated with it as well and if one of the events fails, e.g. compute_prep_resize fails due to a resize resource claim failure on the dest compute, the exception traceback will be recorded in the event and available in the os-instance-actions REST API for admins by default policy. So like error notifications, mining instance action events might be something to look into. [1] https://bugs.launchpad.net/nova/+bug/1811235 [2] https://bugs.launchpad.net/nova/+bug/1742102 -- Thanks, Matt
participants (2)
-
Matt Riedemann
-
Mohammed Naser