[OpenStack-Infra] Log storage/serving

jeblair at openstack.org jeblair at openstack.org
Tue Sep 10 21:54:46 UTC 2013


Hi,

We've had a few conversations recently in various fora about log
storage, so I thought it'd be a good idea to write down some ideas.

The current state is that Jenkins uses SCP to copy files to
static.openstack.org, which has an Apache vhost for logs.openstack.org.
There's a really big filesystem, and we use Apache mod_autoindex to
automatically serve directory indexes.  The destination log paths are
calculated in advance by Zuul (actually in a custom parameter function
defined by our configuration -- Zuul itself knows nothing about this),
they are passed to Jenkins as a parameter, and the same paths are used
to build the URL left in the review text in Gerrit.

This causes us to need to maintain a very large filesystem (we use
Cinder volumes with LVM, so it's not so bad), but it's still not very
cloudy, and does require occasional manual work.  Swift is an obvious
candidate for storing this sort of thing.

The reason it was built this way instead of using swift is simply time:
SCP and mod_autoindex already existed.  Swift (at least the two
implementations we have access to) are not great at calculating and
serving indexes to information -- so _something_ needs to be written in
order to use Swift (either index pages for log files we generate, or an
application that stores logs in swift and retrieves them and serves them
over the web).

I like the approach of having an application store and retrieve log
data.  It would accomplish a number of goals:

* By using something other than SCP, we can reduce the access needed by
  the worker.  Currently Jenkins can write to anywhere in the log
  filesystem, and we just count on the integrity of the Jenkins master
  to prevent abuse of that privilege.

* A log-receiving mechanism with tighter access controls means that we
  could use a different kind of worker (something without the
  master/slave separation that Jenkins has) so that the job itself could
  upload its own logs.

* A log-receiver could pre-process logs (compression, highlighting,
  shipping to logstash, etc).

* The log-receiving and log-serving application(s) would be horizontally
  scalable (static.o.o has been and could again be a bottleneck).

* The log-serving application could also do any processing before
  serving.

* Finally, all of this is actually fairly generalizable to artifact
  processing, such as tarballs, so we should probably switch to calling
  it artifact storage and retrieval.

Sean Dague recently wrote a mod_python script that turns some OpenStack
log files into HTML with syntax highlighting and links:

  http://git.openstack.org/cgit/openstack-infra/config/tree/modules/openstack_project/files/logs/htmlify-screen-log.py

This seems like it could be a good starting point, as it actually
addresses one of the points in the above list.

Here's how I think we could get from where we are to where we want to be:

1) Have Zuul generate a token (suggestion: HMAC signature using a shared
secret) that can later be used to determine what kind of artifacts a job
should be permitted to store, and where they can be stored.  Eg, a token
might say that this run of gate-tempest can store artifacts to the logs
container at '.../gate-tempest/1234' for the next 6 hours.  Another job
might get a token (or multiple tokens) that say it can store logs as
well as a tarball.

This way even a completely untrusted worker can store artifacts because
the token (which is effectively public) is scoped to only what the job
needs.  This could be done entirely with a custom parameter function
(just as the log paths are currently calculated) without any changes to
Zuul itself, or we could extend Zuul to natively support this concept.

2) Write a program (or extend the mod_python script) that accepts
artifacts over HTTP with a token.  It would then write them to the
filesystem as we do now.  It can offline-validate the token with the
shared secret (between it and Zuul).  It could also invalidate the token
after its use.

3) Write a script that we can invoke from within our Jenkins jobs to use
the token to upload artifacts.  Other non-Jenkins workers can use the
same protocol to upload their artifacts.

4) Write a program (or extend the mod_python script) that accepts
requests (using the same URL format) and reads the files from the
filesystem and serves them.

5) Extend the artifact serving program in #3 so that it first checks a
mysql database (we can use trove to provide the db) for each request; if
it finds the item, then it serves it from swift.  If it is for a
directory instead of a file, it uses the database to calculate the index
and generates an index page and serves it.  If the item is not found in
the DB, it fetches it from the disk.  If it's a directory that isn't in
the DB, it generates an index based on the filesystem directory
contents.

6) Extend the artifact storing program in #2 to optionally store the
artifacts in swift instead of the filesystem.

I think that approach gives us a reasonably secure system, and the
stepwise nature means that we can test each component in turn, and
provide a smooth transition.

Some variants to consider:

  * The token system doesn't have to be HMAC-based; there's lots of
    stuff out there.  We could do online validation with Zuul instead of
    a shared secret, for instance.

  * Not trying to do the phased implementation, and just doing a cutover
    with downtime (and bulk import old data).

  * Also, it would be nice to make pre and post processing easily
    pluggable and configurable early on; there's no telling what we may
    want to do in the future.

I think that about encompasses the ideas and conversations I've had
around the subject.  Any thoughts?

-Jim



More information about the OpenStack-Infra mailing list