Hi everyone, This is just a very open ended discussion that might bring up some interesting ideas, either that I'm going the wrong way about this, or perhaps this is something we need to think more about. As we go about increasing the instrumentation of the clouds we run, one of the interesting ideas was to measure the "ERROR" instance rate to see if more VMs than usual are hitting ERROR state. The problem with this right now is that there are a few factors at the moment where an instance can hit an ERROR state which are *not* cause of concern for the operator. For example, if you're booting an instance from a volume and you're at your quota, the instance will fail to boot, end up in "ERROR" state. If we're instrumenting that, we are likely going to be alerted on a high # of ERROR state instances but there's not much that we can do about it realistically. However, if we're getting a lot of ERROR instances because of "NoValidHost" or because some other valid failures such as in libvirt or RBD, then we'd probably want to be alerted on those. Does anyone have any ideas on how we can either better instrument this, or perhaps seeing how inside Nova, we have a "system error" and a "user error" Thanks :) Mohammed