[openstack-dev] RFC - Icehouse logging harmonization

Sean Dague sean at dague.net
Wed Oct 23 19:55:36 UTC 2013


On 10/23/2013 03:35 PM, Robert Collins wrote:
> On 24 October 2013 08:28, John Griffith <john.griffith at solidfire.com> wrote:
>> So I touched on this a bit in my earlier post but want to reiterate here and
>> maybe clarify a bit.  I agree that cleaning up and standardizing the logs is
>> a good thing, and particularly removing unhandled exception messages would
>> be good.  What concerns me however is the approach being taken here of
>> saying things like "Error level messages are banned from Tempest runs".
>>
>> The case I mentioned earlier of the negative test is a perfect example.
>> There's no way for Cinder (or any other service) to know the difference
>> between the end user specifying/requesting a non-existent volume and a valid
>> volume being requested that for some reason can't be found.  I'm not quite
>> sure how you place a definitive rule like "no error messages in logs" unless
>> you make your tests such that you never run negative tests?
>
> Let me check that I understand: you want to check that when a user
> asks for a volume that doesn't exist, they don't get it, *and* that
> the reason they didn't get it was due to Cinder detecting it's
> missing, not due to e.g. cinder throwing an error and returning 500 ?
>
> If so, that seems pretty straight forward; a) check the error that is
> reported (it should be a 404 and contain an explanation which we can
> check) and b) check the logs to see that nothing was logged (because a
> server fault would be logged).
>
>> There are other cases in cinder as well that I'm concerned about.  One
>> example is iscsi target creation, there are a number of scenarios where this
>> can fail under certain conditions.  In most of these cases we now have retry
>> mechanisms or alternate implementations to complete the task.  The fact is
>> however that a call somewhere in the system failed, this should be something
>> in my opinion that stands out in the logs.  Maybe this particular case would
>> be well suited to being a warning other than an error, and that's fine.  My
>> point however though is that I think some thought needs to go into this
>> before making blanketing rules and especially gating criteria that says "no
>> error messages in logs".

Absolutely agreed. That's why I wanted to kick off this discussion and 
am thinking about how we get to agreement by Icehouse (giving this lots 
of time to bake and getting different perspectives in here).

On the short term of failing jobs in tempest because they've got errors 
in the logs, we've got a whole white list mechanism right now for 
"acceptable errors". Over time I'd love to shrink that to 0. But that's 
going to be a collaboration between the QA team and the specific core 
projects to make sure that's the right call in each case. Who knows, 
maybe there are generally agreed to ERROR conditions that we trigger, 
but we'll figure that out overtime.

I think the iscsi example is a good case for WARNING, which is the same 
level we use when we fail to schedule a resource (compute / volume). 
Especially because we try to recover now. If we fail to recover, ERROR 
is probably called for. But if we actually failed to alocate a volume, 
we'd end up failing the tests anyways, which means the ERROR in the log 
wouldn't be a problem in and of itself.

> I agree thought and care is needed. As a deployer my concern is that
> the only time ERROR is logged in the logs is when something is wrong
> with the infrastructure (rather than a user asking for something
> stupid). I think my concern and yours can both be handled at the same
> time.

Right, and I think this is the perspective that I'm coming from. Our 
logs (at INFO and up) are UX to our cloud admins.

We should be pretty sure that we know something is a problem if we tag 
it as an ERROR, or CRITICAL. Because that's likely to be something that 
negatively impacts someones day.

If we aren't completely sure your cloud is on fire, but we're pretty 
sure something is odd, WARNING is appropriate.

If it's no good, but we have no way to test if it's a problem, it's just 
INFO. I really think the "not found" case falls more into standard INFO.

Again, more concrete instances like the iscsi case, are probably the 
most helpful. I think in the abstract this problem is too hard to solve, 
but with examples, we can probably come to some concensus.

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list