On 11/24/2019 3:31 PM, Mohammed Naser wrote:
For example, if you're booting an instance from a volume and you're at your quota, the instance will fail to boot, end up in "ERROR" state. If we're instrumenting that, we are likely going to be alerted on a high # of ERROR state instances but there's not much that we can do about it realistically.
We could probably eliminate most of that specific scenario by checking volume quota in the API like we do for port quota so the user would get a 403 error rather than one or more instances that failed to build in ERROR status. I know this isn't the gist of your email, but my point is we have historically just punted and set instances to ERROR status in a lot of cases but that might not necessarily be correct, e.g. [1]. So drilling in on common cases where the operation fails and the instance is just put to ERROR status is worthwhile IMO. If you reduce the number of times an instance goes to ERROR status for predictable reasons, then your alerts go down and when you do get alerted you should have a smaller set of things that you can reliably filter into an "ignore" bucket, like quota-related failures.
Does anyone have any ideas on how we can either better instrument this, or perhaps seeing how inside Nova, we have a "system error" and a "user error"
I would think there are also versioned notifications involved in these operations with error payloads that you could be inspecting to see if it's really something for which you need to be paged. That might get pretty whack-a-mole though since lots of operations can fail in lots of ways and trying to whitelist that would be hard (see the conversation in [2]). Every operation will have instance action events associated with it as well and if one of the events fails, e.g. compute_prep_resize fails due to a resize resource claim failure on the dest compute, the exception traceback will be recorded in the event and available in the os-instance-actions REST API for admins by default policy. So like error notifications, mining instance action events might be something to look into. [1] https://bugs.launchpad.net/nova/+bug/1811235 [2] https://bugs.launchpad.net/nova/+bug/1742102 -- Thanks, Matt