[openstack-dev] os-loganalyze, project log parsing, or ...

Matthew Treinish mtreinish at kortar.org
Tue Sep 27 20:43:45 UTC 2016


On Tue, Sep 27, 2016 at 03:32:17PM -0400, Andrew Laski wrote:
> 
> 
> On Tue, Sep 27, 2016, at 02:40 PM, Matthew Treinish wrote:
> > On Tue, Sep 27, 2016 at 01:03:35PM -0400, Andrew Laski wrote:
> > > 
> > > 
> > > On Tue, Sep 27, 2016, at 12:39 PM, Matthew Treinish wrote:
> > > > On Tue, Sep 27, 2016 at 11:36:07AM -0400, Andrew Laski wrote:
> > > > > Hello all,
> > > > > 
> > > > > Recently I noticed that people would look at logs from a Zuul née
> > > > > Jenkins CI run and comment something like "there seem to be more
> > > > > warnings in here than usual." And so I thought it might be nice to
> > > > > quantify that sort of thing so we didn't have to rely on gut feelings.
> > > > > 
> > > > > So I threw together https://review.openstack.org/#/c/376531 which is a
> > > > > script that lives in the Nova tree, gets called from a devstack-gate
> > > > > post_test_hook, and outputs an n-stats.json file which can be seen at
> > > > > http://logs.openstack.org/06/375106/8/check/gate-tempest-dsvm-multinode-live-migration-ubuntu-xenial/e103612/logs/n-stats.json.
> > > > > This provides just a simple way to compare two runs and spot large
> > > > > changes between them. Perhaps later things could get fancy and these
> > > > > stats could be tracked over time. I am also interested in adding stats
> > > > > for things that are a bit project specific like how long (max, min, med)
> > > > > it took to boot an instance, or what's probably better to track is how
> > > > > many operations that took for some definition of an operation.
> > > > > 
> > > > > I received some initial feedback that this might be a better fit in the
> > > > > os-loganalyze project so I took a look over there. So I cloned the
> > > > > project to take a look and quickly noticed
> > > > > http://git.openstack.org/cgit/openstack-infra/os-loganalyze/tree/README.rst#n13.
> > > > > That makes me think it would not be a good fit there because what I'm
> > > > > looking to do relies on parsing the full file, or potentially multiple
> > > > > files, in order to get useful data.
> > > > > 
> > > > > So my questions: does this seem like a good fit for os-loganalyze? If
> > > > > not is there another infra/QA project that this would be a good fit for?
> > > > > Or would people be okay with a lone project like Nova implementing this
> > > > > in tree for their own use?
> > > > > 
> > > > 
> > > > I think having this in os-loganalyze makes sense since we use that for
> > > > visualizing the logs already. It also means we get it for free on all the
> > > > log
> > > > files. But, if it's not a good fit for a technical reason then I think
> > > > creating
> > > > another small tool under QA or infra would be a good path forward. Since
> > > > there
> > > > really isn't anything nova specific in that.
> > > 
> > > There's nothing Nova specific atm because I went for low hanging fruit.
> > > But if the plan is to have Nova specific, Cinder specific, Glance
> > > specific, etc... things in there do people still feel that a QA/infra
> > > tool is the right path forward. That's my only hesitation here.
> > 
> > Well I think that raises more questions, what do you envision the nova
> > specific
> > bits would be. The only thing I could see would be something that looks
> > for
> > specific log messages or patterns in the logs. Which feels like exactly
> > what
> > elastic-recheck does?
> 
> I'm thinking beyond single line things. An example could be a parser
> that can calculate the timing between the first log message seen for a
> request-id and the last, or could count the number of log lines
> associated with each instance boot perhaps even broken down by log
> level. Things that require both an understanding of how to correlate
> groups of log lines with specific events(instance boot), and being able
> to calculate stats for groups of log lines(debug log line count by
> request-id).
> 
> I have only a rudimentary familiarity with elastic-recheck but my
> understanding is that doing anything that looks at multiple lines like
> that is either complex or not really possible.


That's a limitation in the matching for an elastic-recheck fingerprint because
a document in elasticsearch is a single log line. (with the exception of
tracebacks which is the full traceback) But if you have multiple matches in a
single log file there is no reason you can't use elasticsearch to find any
of that. For example, using your paste below would be something like:

https://goo.gl/mS5HyC (you'll probably want to expand the date range)

You can then visualize the different log levels using the UI. (its on the left)

I was using elastic-recheck as an example because it's a tool to find
fingerprints in the logs which is essenially all your doing (even in the example
below) But, what I really meant was the techniques (mainly using elasticsearch)
we use in e-r because the exact tooling isn't a direct fit here. That being said
I'm not sure elastic search is the best tool for this, while it's certainly
possible to do most of the analysis you're looking to do it's data limitation of
10 days would probably make it a deal breaker. It would make more sense if you
were trying to trend things across multiple runs, but just for printing stats
about a run it's probably not a good fit.

> 
> 
> > 
> > I definitely can see the value in having machine parsable log stats in
> > our
> > artifacts, but I'm not sure where project specific pieces would come
> > from. But,
> > given that hypothetical I would say as long as you made those pieces
> > configurable (like a yaml syntax to search for patterns by log file or
> > something) and kept a generic framework/tooling for parsing the log stats
> > I
> > think it's still a good fit for a QA or Infra project. Especially if you
> > think
> > whatever pattern you're planning to use is something other projects would
> > want
> > to reuse.
> 
> My concern here is that I want to go beyond simple pattern matching. I
> want to be able to maintain state while parsing to associate log lines
> with events that came before. The project specific bits I envision are
> the logic to handle that, but I don't think yaml is expressive enough
> for it. I came up with a quick example at
> http://paste.openstack.org/show/583160/ . That's Nova specific and
> beyond my capability to express in yaml or elastic-recheck.

That's pretty simple to do with yaml too. Especially since it's tied to a single
regex. For example, something roughly like:

http://paste.openstack.org/show/583165/

would use yaml to make it a generic framework. 

That's what I was trying to get across before, what might seem project specific
is just a search pattern or combination of them for a log file. These patterns
are generic enough we can extract out the project specific pieces and it could
be made configurable to enable reuse between projects. Especially since most of
the logging format should be standardized. (this would further expose where
things differ too)

So yeah I still think this makes sense as a QA or Infra project. Although I'm
wondering if there is a better way to harvest this info from a run.

As an aside it does feel like we should have a better mechanism for most of
this. Like for counting booted instances I normally use the qemu log files, for
example:

http://logs.openstack.org/49/373249/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/b0159bf/logs/libvirt/qemu/

which shows how many instances we booted during that run. But I guess it
really depends on what we're looking for. So there probably isn't an easier
answer.

-Matt Treinish

> > > > 
> > > > I would caution against doing it as a one off in a project repo doesn't
> > > > seem
> > > > like the best path forward for something like this. We actually tried to
> > > > do
> > > > something similar to that in the past inside the tempest repo:
> > > > 
> > > > http://git.openstack.org/cgit/openstack/tempest/tree/tools/check_logs.py
> > > > 
> > > > and
> > > > 
> > > > http://git.openstack.org/cgit/openstack/tempest/tree/tools/find_stack_traces.py
> > > > 
> > > > all it did was cause confusion because no one knew where the output was
> > > > coming
> > > > from. Although, the output from those tools was also misleading, which
> > > > was
> > > > likely a bigger problm. So this probably won't be an issue if you add a
> > > > json
> > > > output to the jobs.
> > > > 
> > > > I also wonder if the JSONFormatter from oslo.log:
> > > > 
> > > > http://docs.openstack.org/developer/oslo.log/api/formatters.html#oslo_log.formatters.JSONFormatter
> > > > 
> > > > would be useful here. We can proabbly turn that on if it makes things
> > > > easier.
> > > > 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160927/4cba2bc7/attachment.pgp>


More information about the OpenStack-dev mailing list