[openstack-dev] [glance][all] Help with interpreting the log level guidelines
Sean Dague
sean at dague.net
Tue Sep 16 16:30:44 UTC 2014
On 09/16/2014 12:07 PM, Kuvaja, Erno wrote:
>> -----Original Message-----
>> From: Sean Dague [mailto:sean at dague.net]
>> Sent: 16 September 2014 15:56
>> To: openstack-dev at lists.openstack.org
>> Subject: Re: [openstack-dev] [glance][all] Help with interpreting the log level
>> guidelines
>>
>> On 09/16/2014 10:16 AM, Kuvaja, Erno wrote:
>>>> -----Original Message-----
>>>> From: Sean Dague [mailto:sean at dague.net]
>>>> Sent: 16 September 2014 12:40
>>>> To: openstack-dev at lists.openstack.org
>>>> Subject: Re: [openstack-dev] [glance][all] Help with interpreting the
>>>> log level guidelines
>>>>
>>>> On 09/16/2014 06:44 AM, Kuvaja, Erno wrote:
>>>>>> -----Original Message-----
>>>>>> From: Flavio Percoco [mailto:flavio at redhat.com]
>>>>>> Sent: 16 September 2014 10:08
>>>>>> To: openstack-dev at lists.openstack.org
>>>>>> Subject: Re: [openstack-dev] [glance][all] Help with interpreting
>>>>>> the log level guidelines
>>>>>>
>>>>>> On 09/16/2014 01:10 AM, Clint Byrum wrote:
>>>>>>> Excerpts from Sean Dague's message of 2014-09-15 16:02:04 -0700:
>>>>>>>> On 09/15/2014 07:00 PM, Mark Washenberger wrote:
>>>>>>>>> Hi there logging experts,
>>>>>>>>>
>>>>>>>>> We've recently had a little disagreement in the glance team
>>>>>>>>> about the appropriate log levels for http requests that end up
>>>>>>>>> failing due to user errors. An example would be a request to get
>>>>>>>>> an image that does not exist, which results in a 404 Not Found
>> request.
>>>>>>>>>
>>>>>>>>> On one hand, this event is an error, so DEBUG or INFO seem a
>>>>>>>>> little too low. On the other hand, this error doesn't generally
>>>>>>>>> require any kind of operator investigation or indicate any
>>>>>>>>> actual failure of the service, so perhaps it is excessive to log
>>>>>>>>> it at WARN or
>>>> ERROR.
>>>>>>>>>
>>>>>>>>> Please provide feedback to help us resolve this dispute if you
>>>>>>>>> feel you
>>>>>> can!
>>>>>>>>
>>>>>>>> My feeling is this is an INFO level. There is really nothing the
>>>>>>>> admin should care about here.
>>>>>>>
>>>>>>> Agree with Sean. INFO are useful for investigations. WARN and
>>>>>>> ERROR are cause for alarm.
>>>>>>
>>>>>> +1 this is what we do in Zaqar as well.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> @flaper87
>>>>>> Flavio Percoco
>>>>>>
>>>>>
>>>>> I think the debate here does not only limit to 404s. By the logging
>>>>> guidelines
>>>> INFO level messages should not contain any error related messages but
>>>> rather stuff like certain components starting/stopping, config info, etc.
>>>> WARN should not be anything that gets the ops pulled out of bed and so
>> on.
>>>>>
>>>>> Also all information that would be in interest of ops should be
>>>>> logged
>>>> INFO+.
>>>>>
>>>>> Now if we are logging user errors as WARN that makes the environment
>>>> supportable even if the logging has been set as high as WARN cleaning
>>>> the output a lot (as INFO shouldn't contain anything out of order
>> anyways).
>>>> Current situation is that logging at DEBUG level is the only option
>>>> to get the needed information to actually run the services and get
>>>> the data needed to support it as well. If we log user errors on INFO
>>>> we get one step higher but we still have all that clutter like every
>>>> single request in the logs and if that's the direction we want to go, we
>> should revisit our logging guidelines as well.
>>>>>
>>>>> Thus my two euro cents goes towards WARN rather than debug and
>>>> definitely not INFO.
>>>>
>>>> Part of it is how often you expect things to happen as well. Remember
>>>> glanceclient opperates in the context of "other" processes. When it
>>>> hits a 404 in Glance, it's not running in the glance context, it's
>>>> running in the Nova context. Which means it needs to think of itself in
>> that context.
>>>>
>>>> In that context we got the exception back from Glance, we know the
>>>> image wasn't there. And we know whether or not that's a problem
>>>> (glanceclient actually has no idea if it's a problem or not, we might
>>>> be checking to make sure a thing isn't there, and success for us is the 404).
>>>>
>>>> So actually, I'm back to Jay on this, it should be DEBUG. Nova (or
>>>> whoever the caller is) can decide if the issue warents something larger
>> than that.
>>>>
>>>> This is really the biggest issue with logging in the clients, people
>>>> don't think about the context that they are running in.
>>>>
>>>> -Sean
>>>>
>>>> --
>>>> Sean Dague
>>>> http://dague.net
>>>>
>>>
>>> Sean,
>>>
>>> I'm not sure if we were specific enough here. Not talking about client but
>> the server logging. So how we should log events like client trying to change
>> protected properties, access non existing image, create duplicate image IDs,
>> etc.
>>>
>>> So for example if Nova is trying to access image that does not exist should
>> we ignore it on Glance side or when the user tries to do something that does
>> not succeed. In my point of view it makes life much easier if we have
>> information where the request failed rather than just a wsgi return code or
>> having to run the system on DEBUG logging to get that information.
>>
>> Glance client throws an ERROR on 404 from Glance server -
>> http://logs.openstack.org/81/120781/4/check/check-tempest-dsvm-
>> full/90cb640/logs/screen-n-api.txt.gz?level=ERROR
>>
>> Glance server does not -
>> http://logs.openstack.org/81/120781/4/check/check-tempest-dsvm-
>> full/90cb640/logs/screen-g-api.txt.gz?level=ERROR
>>
>> Which is why I assumed this was where the conversation started.
>>
>> On the server side this should be an INFO with 404 recorded as the wsgi
>> request.
>>
>> On the client this should be debug.
>>
>> Just to get a look at how completely confusing the current scenario is, lets
>> dive through the logs about what should be changed.
>>
>> Inside glance you also need to address client vs. server. For instance -
>> http://logs.openstack.org/81/120781/4/check/check-tempest-dsvm-
>> full/90cb640/logs/screen-g-api.txt.gz?level=INFO#_2014-09-
>> 15_19_42_25_137
>> is incredibly confusing that it's an INFO, because that's a client call.
>> The next line is the actual wsgi request you care about.
>>
>> swiftclient is equally confusing with -
>> http://logs.openstack.org/81/120781/4/check/check-tempest-dsvm-
>> full/90cb640/logs/screen-g-api.txt.gz?level=INFO#_2014-09-
>> 15_19_42_27_206
>>
>> It should not be logging an ERROR there.
>>
>> Also whatever keeps dumping this:
>>
>> 127.0.0.1 - - [15/Sep/2014 19:42:38] code 400, message Bad request syntax
>> ('0')
>> 127.0.0.1 - - [15/Sep/2014 19:42:38] "0" 400 -
>>
>> Needs to be addressed.
>>
>> urllib3.connectionpool needs to be set to WARN level cutoff, the INFO
>> messages cloud everything.
>>
>> keystone client shouldn't log at info -
>> http://logs.openstack.org/81/120781/4/check/check-tempest-dsvm-
>> full/90cb640/logs/screen-g-api.txt.gz?level=INFO#_2014-09-
>> 15_19_47_27_833
>>
>> Knowing on every keystone client request which token version is being used
>> is not INFO material.
>>
>> urllib3.connectionpool logging issue again -
>> http://logs.openstack.org/81/120781/4/check/check-tempest-dsvm-
>> full/90cb640/logs/screen-g-api.txt.gz?level=INFO#_2014-09-
>> 15_19_47_52_000
>>
>> And if this WARNING isn't something that's addressable, it should be killed
>> off -
>> http://logs.openstack.org/81/120781/4/check/check-tempest-dsvm-
>> full/90cb640/logs/screen-g-api.txt.gz?level=INFO#_2014-09-
>> 15_19_50_59_792
>>
>> -Sean
>>
>> --
>> Sean Dague
>> http://dague.net
>>
>
> I think the discussion started from https://review.openstack.org/#/c/116626/8/glance/api/v1/images.py L#964-
> While trying to update the image we get thrown 404 and before re raising it we used to log it as INFO and proposed to WARN (as per guidelines). Now this issue might be user initiated (typo on the image ID etc.) or something else. In my understanding based on the Logging Standards https://wiki.openstack.org/wiki/LoggingStandards "Info: Usually indicates successful service start/stop, versions and such non-error related data" (under Debug - Log Level Definitions). Also the referred discussion gives direction I can easily relate to http://stackoverflow.com/questions/2031163/when-to-use-log-level-warn-vs-error in the comments pointed out that INFO and WARN could be merged but it's easier to spot out the issues when INFO is kept on basic state changes etc. and WARN used for potential issues.
>
> What comes to the Glance Registry client logging 404 as INFO I do not see what makes it confusing (although based on above it should not be INFO but rather WARN). The actual message is clear. We got 404 from registry so we return (the following line) 404 to the client. That line there being DEBUG and we're again in the point where you must run the system on DEBUG level logging or you have no idea where to look at for the issue unless you know the internals well.
It should not be a WARN. Really, it should not be a WARN. There is
nothing WARN about a 404, it's just a fact of life.
The thing that makes it confusing is unless you know all the glance code
you think those 2 lines are different operations. But they are the same
operation. Logging the same logical thing from 2 places.
"A man with a watch knows what time it is. A man with two watches is
never sure." - http://en.wikipedia.org/wiki/Segal's_law
Logging the same information twice in slightly different ways just
generates confusion about what's going on.
> What worries me way more is that swiftclient actually logs the trace of 404 call, not so much that there is error, which could be somewhat justifiable:
> http://logs.openstack.org/81/120781/4/check/check-tempest-dsvm-full/90cb640/logs/screen-g-api.txt.gz?level=INFO#_2014-09-15_19_42_27_206
>
> Thanks for broadening the discussion a bit out of Glance. I think this is topic that should be discussed on wider scope. I think if nothing else we need to look into the Logging Standards regarding the pointers raised here.
Right, the big issue is that the logging conversation can't happen just
in the code, you need to look at real output, which is typically the
aggregate effect of dozens of modules directly writing out log messages.
It's actually a hugh architectural issue of python logging for making
server software that feels even moderately coherent.
-Sean
--
Sean Dague
http://dague.net
More information about the OpenStack-dev
mailing list