[openstack-dev] RFC - Icehouse logging harmonization

John Griffith john.griffith at solidfire.com
Wed Oct 23 19:28:35 UTC 2013


On Wed, Oct 23, 2013 at 1:03 PM, Clark Boylan <clark.boylan at gmail.com>wrote:

> On Wed, Oct 23, 2013 at 11:20 AM, Sean Dague <sean at dague.net> wrote:
> > One of the efforts that we're working on from the QA team is tooling that
> > ensures we aren't stack tracing into our test logs during normal tempest
> > runs. Random stack traces are scary to cloud admins consuming OpenStack
> > logs, and exceptions in the logs should really be exceptional events (and
> > indicative of a failing system), not something that we do by default. Our
> > intent is to gate code on clean logs (no stacktraces) eventually (i.e. if
> > you try to land a patch that causes stack traces in OpenStack, that
> becomes
> > a failing condition), and we've got an incremental white list based
> approach
> > that should let us make forward progress on that. But on that thread -
> >
> http://lists.openstack.org/pipermail/openstack-dev/2013-October/017012.html
> > we exposed another issue... across projects, OpenStack is very
> inconsistent
> > with logging.
> >
> > First... baseline, these are the logging levels that we have in OpenStack
> > today (with numeric values, higher = worse):
> >
> > CRITICAL = 50
> > FATAL = CRITICAL
> > ERROR = 40
> > WARNING = 30
> > WARN = WARNING
> > AUDIT = 21  # invented for oslo-logging
> > INFO = 20
> > DEBUG = 10
> > NOTSET = 0
> >
> > We also have TRACE, which isn't a level per say, it happens at another
> > level. However TRACE is typically an ERROR in the way we use it.
> >
> >
> > Some examples of oddities in the current system (all from a single
> > devstack/tempest run):
> >
> > Example 1:
> > ==============
> >
> >     n-conductor log in tempest/devstack -
> >
> http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-n-cond.txt.gz
> >
> >     Total log lines: 84076
> >     Total non DEBUG lines: 61
> >
> >     Question: do we need more than 1 level of DEBUG? 3 orders of
> magnitude
> > information change between INFO -> DEBUG seems too steep a cliff.
> >
> > Example 2:
> > ==============
> >
> >     ceilometer-collector -
> >
> http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-ceilometer-collector.txt.gz
> >
> >     AUDIT log level being used as DEBUG level (even though it's higher
> than
> > INFO).
> >
> > 2013-10-23 12:24:20.093 26234 AUDIT ceilometer.pipeline [-] Flush
> pipeline
> > meter_pipeline
> > 2013-10-23 12:24:20.093 26234 AUDIT ceilometer.pipeline [-] Flush
> pipeline
> > cpu_pipeline
> > 2013-10-23 12:24:20.094 26234 AUDIT ceilometer.pipeline [-] Flush
> pipeline
> > meter_pipeline
> > 2013-10-23 12:24:20.094 26234 AUDIT ceilometer.pipeline [-] Flush
> pipeline
> > cpu_pipeline
> >
> > (this is every second, for most seconds, for the entire run)
> >
> > Example 3:
> > ===============
> >
> >     cinder-api -
> >
> http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-c-api.txt.gz?level=ERROR
> >     ERROR level being used for 404s of volumes
> >
> > Example 4:
> > ===============
> > glance-api -
> >
> http://logs.openstack.org/70/52870/3/check/check-tempest-devstack-vm-full/f46b756/logs/screen-g-api.txt.gz
> >
> > 2013-10-23 12:23:27.436 23731 ERROR glance.store.sheepdog [-] Error in
> store
> > configuration: Unexpected error while running command.Command: collieExit
> > code: 127Stdout: ''Stderr: '/bin/sh: 1: collie: not found\n'
> > 2013-10-23 12:23:27.436 23731 WARNING glance.store.base [-] Failed to
> > configure store correctly: Store sheepdog could not be configured
> correctly.
> > Reason: Error in store configuration: Unexpected error while running
> > command.Command: collieExit code: 127Stdout: ''Stderr: '/bin/sh: 1:
> collie:
> > not found\n' Disabling add method.
> >
> > part of every single Tempest / Devstack run, even though we aren't
> trying to
> > configure sheepdog in the gate
> >
> >
> > I think we can, and should do better, and started trying to brain dump
> into
> > this etherpad -
> > https://etherpad.openstack.org/p/icehouse-logging-harmonization(examples
> > included).
> >
> > This is one of those topics that I think our current 6 track summit model
> > doesn't make easy address, as we really need general concensus across any
> > project that's using oslo-logging, so I believe mailing list is the
> better
> > option, at least for now.
> >
> >
> > Goals - Short Term
> > ===============
> > As much feedback as possible from both core projects and openstack
> deployers
> > about the kinds of things that they believe we should be logging, and the
> > kinds of levels they think those things should land at.
> >
> > Determining how crazy it is to try to harmonize this across services.
> >
> > Figure out who else wants to help. Where help means:
> >  * helping figure out what's already concensus in services
> >  * helping figure out things that are really aberrant from that concensus
> >  * helping build concensus with various core teams on a common
> >  * helping with contributions to projects that are interested in
> > contributions to move them closer to the concensus
> >
> > Determining if everyone just hates the idea, and I should give up now. :)
> > (That is a valid response to this RFC, feel free to put that out there).
> >
> >
> > Goals - Long Term
> > ===============
> > A set of guidelines on logging standards so that OpenStack as a whole
> feels
> > more whole when it comes to dealing with the log data.
> >
> > These are going to be guidelines, not rules. Some projects are always
> going
> > to have unique needs. But I suspect a lot of developers would be happy
> with
> > a cheet sheet that helps them figure out what log level different things
> are
> > presented at.
> >
> >
> > Comments / rotten fruit are welcomed. Also welcomed are any additional
> POV
> > on this issue.
> >
> >         -Sean
> >
> This is great. In additional to all this I want to point out that the
> TRACE level is also a oslo-logging invention. In general I am not a
> fan of having additional levels as compatibility with python logging
> and associated tools would be a good thing, but I can see their
> usefulness if we need more granularity. Also, this isn't just for
> developers and QA, deployers of OpenStack would benefit tremendously
> from more consistent and useful logging. After throwing together
> logstash filters[1] for the OpenStack service logs produced by
> Devstack in the gate I have found the logs to be quite diverse which
> makes dealing with them in a centralized fashion more difficult. +1
> from me to guidelines which should drive more consistency.
>
> This got better during the Havana cycle, but we can definitely do more
> in Icehouse.
>
> [1]
> https://git.openstack.org/cgit/openstack-infra/config/tree/modules/openstack_project/templates/logstash/indexer.conf.erb#n11
>
> Clark
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
So I touched on this a bit in my earlier post but want to reiterate here
and maybe clarify a bit.  I agree that cleaning up and standardizing the
logs is a good thing, and particularly removing unhandled exception
messages would be good.  What concerns me however is the approach being
taken here of saying things like "Error level messages are banned from
Tempest runs".

The case I mentioned earlier of the negative test is a perfect example.
 There's no way for Cinder (or any other service) to know the difference
between the end user specifying/requesting a non-existent volume and a
valid volume being requested that for some reason can't be found.  I'm not
quite sure how you place a definitive rule like "no error messages in logs"
unless you make your tests such that you never run negative tests?

There are other cases in cinder as well that I'm concerned about.  One
example is iscsi target creation, there are a number of scenarios where
this can fail under certain conditions.  In most of these cases we now have
retry mechanisms or alternate implementations to complete the task.  The
fact is however that a call somewhere in the system failed, this should be
something in my opinion that stands out in the logs.  Maybe this particular
case would be well suited to being a warning other than an error, and
that's fine.  My point however though is that I think some thought needs to
go into this before making blanketing rules and especially gating criteria
that says "no error messages in logs".

John
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131023/c44dc952/attachment.html>


More information about the OpenStack-dev mailing list