[OpenStack-Infra] Log storage/serving

Robert Collins robertc at robertcollins.net
Tue Oct 15 21:26:36 UTC 2013


FWIW I want to put a testr front-end on all of this, to do fault
correlation across test runs; this could be implemented a number of
different ways, but I think the key thing is that we may want to stage
things in a couple of different formats (raw, preprocessed for
correlation, preprocessed for humans).

We should do as little processing of logs on the workers as possible,
because log processing doesn't add value to the pass/fail nature of
the test - and the sooner we free up the node, the sooner we can be
spawning another test.

-Rob

On 16 October 2013 09:46, Sean Dague <sean at dague.net> wrote:
> On 10/10/2013 01:42 PM, James E. Blair wrote:
> <snip>
>
>> Okay, let me try to summarize current thinking:
>>
>> * We want to try to avoid writing a tool that receives logs because
>>    swift provides most/all of the needed functionality.
>>    * The swift tempurl middleware will allow us to have the client
>>      directly PUT files in swift using a HMAC signed token.
>>    * This means any pre-processing of logs would need to happen with the
>>      log-uploading-client or via some unspecified event trigger.
>>
>> * We may or may not want a log-serving app.
>>    * We're doing neat things like filtering on level and html-ifying logs
>>      as we serve them with our current log-serving app.
>>    * We could probably do that processing pre-upload (including embedding
>>      javascript in html pages to do the visual filtering) and then we
>>      could serve static pages instead.
>>    * A log serving app may be required to provide some kinds of indexes.
>
>>
>>
>> So to decide on the log-serving app, we need to figure out:
>>
>> 1) What do we want out of indexes?
>>
>> Let's take a current example log path:
>>
>>
>> http://logs.openstack.org/95/50795/4/check/check-grenade-devstack-vm/3c17e3c/console.html
>>
>> Ignoring the change[:-2] at the beginning since it's an implementation
>> artifact, that's basically:
>>
>>    /change/patchset/pipeline/job/run[random]/
>>
>> The upload script can easily handle creating index pages below that
>> point.  But since it runs in the context of a job run, it can't create
>> index pages above that (besides the technical difficulty, we don't want
>> to give it permission outside of its run anyway).  So I believe that
>> without a log-receiving app, our only options are:
>>
>>    a) Use the static web swift middleware to provide indexes.  Due to the
>>    intersection of this feature, CDN, and container sizes with our
>>    current providers, this is complicated and we end up at a dead end
>>    every time we talk through it.
>>
>>    b) Use a log-serving application to generate index pages where we need
>>    them.  We could do this by querying swift.  If we eliminate the
>>    ability to list ridiculously large indexes (like all changes, etc) and
>>    restrict it down to the level of, say, a single change, then this
>>    might be manageable.  However, swift may still have to perform a large
>>    query to get us down to that level.
>>
>>    c) Reduce the discoverability of test runs.  We could actually just
>>    collapse the whole path into a random string and leave that as a
>>    comment in Gerrit.  Users would effectively never be able to discover
>>    any runs other than the final ones that are reported in Gerrit, and
>>    even comparing runs for different patchsets would involve looking up
>>    the URL for each in the respective Gerrit comments.  Openstack-infra
>>    tools, such as elastic-recheck, could still discover other runs by
>>    watching for ZMQ or Gearman events.
>>
>>    This would make little difference to most end-users as well as project
>>    tooling, but it would make it a little harder to develop new project
>>    tooling without access to that event stream.
>>
>> Honestly, option C is growing on me, but I'd like some more feedback on
>> that.
>>
>> 2) What do we want out of processing?
>>
>> Currently we HTMLify and filter logs by log level at run-time when
>> serving them.  I think our choices are:
>>
>>    a) Continue doing this -- this requires a log-serving app that will
>>    fetch logs from swift, process them, and serve them.
>>
>>    b) Pre-process logs before uploading them.  HTMLify and add
>>    client-side javascript line-level filtering.  The logstash script may
>>    need to do its own filtering since it won't be running a javascript
>>    interpreter, but it could probably still do so based on metadata
>>    encoded into the HTML by the pre-processor.  Old logs won't benefit
>>    from new features in the pre-processor though (unless we really feel
>>    like batch-reprocessing).
>>
>> I think the choices of 1c and 2b get us out of the business of running
>> log servers altogether and moves all the logic and processing to the
>> edges.  I'm leaning toward them for that reason.
>
>
> I'm completely indifferent to how storage and upload happen. Filesystem /
> swift / all is good to me.
>
> However, my experience doing htmlify-screen-log.py, and it's maturation into
> openstack-infra/os-loganalyze, and the fact that I probably spend more time
> staring at devstack/tempest logs than just about anyone, has given me a
> couple of thoughts on log-serving.
>
> Our logs are kind of interesting beasts. We have a few different formats,
> and we've got a number of different consumers. There are some real niceties
> of being able to put a dynamic layer between the raw logs and the consumer:
>
> 1) HTTP negotiation - both due to our wsgi app, as well as mod_deflate we
> are able to do content negotiation with the client to serve them the
> appropriate data. This means today you can get either the text/html version
> (if your client supports it), it it doesn't it gets you a text/plain
> version. The content is also wire compressed, automatically, based on your
> client's ability to handle the compression.
>
> 2) Dynamic Filtering - we added the level= parameter to the wsgi script to
> speed up logstash indexing, as it turns out that python is infinitely faster
> at throwing away DEBUG lines than logstash was. It turns out people love it
> to, because
> http://logs.openstack.org/70/44670/3/gate/gate-tempest-devstack-vm-neutron/02d68e3/logs/screen-q-svc.txt.gz?level=TRACE
> loads super quick, and lets you see where top issues are.
>
> There are a few other interesting facts that we discovered in this process -
> n-cpu on a nova-network run comes in at about 5 MB gzipped (40 MB
> uncompresssed) of html once we do our filtering on it. If you are running a
> browser other than Chrome on a nice Intel chip, life isn't good. A future
> enhancement here is to be nicer to people and disable DEBUG by default if
> the file size is too big.
>
> A 40 MB html file means that client side filtering would be problematic.
> First off, you need to take a huge network hit anyway, secondly I expect the
> DOM manipulation at that level of complexity would even give Chrome a run
> for it's money.
>
> And then there is just the nice idea of keeping the raw artifact and the
> presentation layer separate. The fact that we can update our presentation
> filter, and logs from last week, which we are still using to debug issues,
> are easier to ready, is a good thing.
>
> So regardless of the eventual solution here, I *really* want the ability to
> have a presentation layer filter between the raw logs and the clients. HTTP
> has so many nice features of negotiation worked into the spec, which we're
> actually using today, and makes life easier for folks. And I'd really like
> to not loose that.
>
> So 2a has strong vote from me.
>
>         -Sean
>
> --
> Sean Dague
> http://dague.net
>
>
> _______________________________________________
> OpenStack-Infra mailing list
> OpenStack-Infra at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra



-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud



More information about the OpenStack-Infra mailing list