Open Stack

Thu Jul 24 12:46:48 UTC 2014

On Jul 24, 2014, at 8:23 AM, Chris Dent <chdent at redhat.com> wrote:

> On Wed, 23 Jul 2014, Doug Hellmann wrote:
>>> That's bad enough, but much worse, this will catch all sorts of
>>> exceptions, even ones that are completely unexpected and ought to
>>> cause a more drastic (and thus immediately informative) failure
>>> than 'something failed’.
>> 
>> In most cases, we chose to handle errors this way to keep the service
>> running even in the face of “bad” data, since we are trying to
>> collect an audit stream and we don’t want to miss good data if we
>> encounter bad data.
> 
> a) I acknowledge that you're actually one of the "elders" to whom I
>   referred earlier so I hesitate to disagree with you here, so feel
>   free to shoot me down, but…

I don’t claim any special status except that I was there and am trying to provide background on why things are as they are. :-)

> 
> b) "keep the service running" in the face of "bad" is exactly the
>   sort or reason why I don't like this idiom. I think those
>   exceptions which we can enumerate as causes of "bad" should be
>   explicitly caught and explicitly logged and the rest of them
>   should explicitly cause death exactly because we don't know
>   what happened and the situation is _actually_ exceptional and we
>   ought to know now, not later, that it happened, and not some
>   number of minutes or hours or even days later when we notice that
>   some process, though still running, hasn't done any real work.
> 
>   That kind of "keep it alive" rationale often leads to far more
>   complex debugging situations than otherwise.
> 
> In other words there are two kinds of "bad": The bad that we know
> and can expect (even though we don't want it) and the bad that we
> don't know and shouldn't expect. These should be handled
> differently.
> 
> A compromise position (if one is needed) would be something akin to,
> but not exactly like:
> 
>    except (TheVarious, ExceptionsIKnow) as exc:
>        LOG.warning('shame, no workie, but you know, it happens: %s', exc)
>    except Exception:
>        LOG.exception('crisis!')
> 
> This makes it easier to distinguish between the noise and the nasty,
> which I've found to be quite challenging thus far.

There are not, now, any particular guarantees about notification payloads [1], though, and so while we have collector plugins tailored to the types of notifications we see from other services today, the number, type, and content of those notifications are all subject to change. When new services come online and start sending ceilometer data (or an existing service defines a new event type), the event collection code can usually handle it but work needs to be done to convert events to samples for metering. If the payload of a known event type changes, existing code to convert the event to samples may fail to find a field. 

Having a hard-fail error handler is useful in situations where continuing operation would make the problem worse *and* the deployer can fix the problem. Being unable to connect to a database might be an example of this. However, we don’t want the ceilometer collector to shutdown if it receives a notification it doesn’t understand because discarding a malformed message isn’t going to make other ceilometer operations any worse, and seeing such a message isn’t a situation the deployer can fix. Catching KeyError, IndexError, AttributeError, etc. for those cases would be useful if we were going to treat those exception types differently somehow, but we don’t.

That said, it’s possible we could tighten up some of the error handling outside of event processing loops, so as I said, if you have a more specific proposal for places you think we can predict the exception type reliably, we should talk about those directly instead of the general case. You mention distinguishing between “the noise and the nasty” — do you have some logs you can share?

Doug

[1] http://lists.openstack.org/pipermail/openstack-dev/2014-July/039858.html

> 
> -- 
> Chris Dent tw:@anticdent freenode:cdent
> https://tank.peermore.com/tanks/cdent_______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Open Stack

[openstack-dev] [ceilometer] overuse of 'except Exception'

OpenStack

Community

Documentation

Branding & Legal