Open Stack

Mon Jul 16 22:27:10 UTC 2018

Hi,

As you may know, all of the logs from Zuul builds are currently uploaded
to a single static fileserver with about 14TB of storage available in
one large filesystem.  This was easy to set up, but scales poorly, and
we live in constant fear of filesystem corruption necessitating a
lengthy outage for repair or loss of data (an event which happens, on
average, once or twice a year and takes several days to resolve).

Our most promising approaches to solving this involve moving log storage
to swift.  We (mostly Joshua) have done considerable work in the past
but kept hitting blockers.  I think the situation has changed enough
that the issues we hit before won't be a problem now.  I believe we can
use this work as a foundation to, relatively quickly, move our log
storage into swift.  Once there, there's a number of possibilities to
improve the experience around logs and artifacts in Zuul and general.

This email is going to focus mostly on how OpenStack Infra can move our
current log storage and hosting to swift.  I will follow it up with an
email to the zuul-discuss list about further work that we can do that's
more generally applicable to all Zuul users.

This email is the result of a number of previous discussions, especially
with Monty, and many of the ideas here are his.  It also draws very
heavily on Joshua's previous work.  Here's the general idea:

Pre-generate any content for which we currently rely on middleware
running on logs.openstack.org.  Then upload all of that to swift.
Return a direct link to swift for serving the content.

In more detail:

In addition to using swift as the storage backend, we would also like to
avoid running a server as an intermediary.  This is one of the obstacles
we hit last time.  We started to make os-loganalyze (OSLA) a smart proxy
which could serve files from disk and swift.  It threatened to become
very complicated and tax the patience of OSLA's reviewers.  OSLA's
primary author and reviewer isn't really around anymore, so I suspect
the appetite for major changes to OSLA is even less than it may have
been in the past (we have merged 2 changes this year so far).

There are three kinds of automatically generated content on logs.o.o:

* Directory indexes
* OSLA HTMLification of logs
* ARA

If we pre-generate all of those, we don't need any of the live services
on logs.o.o.  Joshua's zuul_swift_upload script already generates
indexes for us.  OSLA can already be used to HTMLify files statically.
And ARA has a mode to pre-generate its output as well (which we used
previously until we ran out of inodes.  So today, we basically have what
we need to pre-generate this data and store it in swift.

Another issue we ran into previously was the transition from filesystem
storage to swift.  This was because in Zuul v2, we could not dynamically
change the log reporting URL.  However, in Zuul v3, since the job itself
reports the final log URL, we can handle the transition by creating new
roles to perform the swift upload and return the swift URL.  We can
begin by using these roles in a new base job so that we can verify
correct operation.  Then, when we're ready, we can switch the default
base job.  All jobs which upload logs to swift will report the new swift
URL; the existing logs.o.o URLs will continue to function until they age
out.

The Zuul dashboard makes finding the location of logs for jobs
(especially post jobs) simpler.  So we no longer need logs.o.o to find
the storage location (files or swift) for post jobs -- a user can just
follow the link from the build history in the dashboard.

Finally, the apache config (and to some degree, OSLA middleware) handles
compression.  Ultimately, we don't actually care if the files are
compressed in storage.  That's an implementation detail (which we care
about now because we operate the storage).  But it's not a user
requirement.  In fact, what we want is for jobs to produce logs in
whatever format they want (plain text, journal, etc).  We want to store
those.  And we want to present them to the user in the original format.
Right now we compress many of them before we upload them to the log
server because, lacking a dedicated upload handler on the log server,
there's no other way to cause them to be stored compressed.

If we're relieved of that burden, then the only thing we really care
about is transfer efficiency.  We should be able to upload files to
swift with Content-Encoding: gzip, and, likewise, users should be able
to download files with Accept-Encoding: gzip.  We should be able to have
efficient transfer without having to explicitly compress and rename
files.  Our first usability win.

The latest version of the zuul_swift_upload script uses the swift
tempurl functionality to upload logs.  This is because it was designed
to run on untrusted nodes.  A closer analog to our current Zuul v3 log
upload system would be to run the uploader on the executor, giving it a
real swift credential.  It can then upload logs to swift in the normal
manner, rather than via tempurl.  It can also create containers as
needed -- another consideration from our earlier work.  By default, it
could avoid creating containers, but we could configure it to create,
say, containers for each first-level of our sharding scheme.  This could
be a general feature of the role that allows for per-site customization.

I think that's the approach we should start with, because it will be the
easiest transition from our current scheme.  However, in the future, we
can move to having the uploads occur from the test nodes themselves
(rather than, or in addition to, the executor), by having a two-part
system.  The first part runs on the executor in a trusted context and
creates any containers needed, then generates a tempurl, and uses that
to have the worker nodes upload to the container directly.  I only
mention this to show that we're not backing ourselves permanently into
executor-only uploads.  But we shouldn't consider this part of the first
phase.

We have also discussed using multiple swifts.  It may be easiest to
start with one, but in a future where we have executor affinity in Zuul,
we may want to upload to the nearest swift.  In that case, we can modify
the role to, rather than be configured with a single swift, support
multiple swifts, and use the executor affinity information to determine
if there is a swift colocated in the executor's cloud, and if not, use a
fallback.  This way we can use multiple swifts as they are available,
but not require them.

To summarize: static generation combined with a new role to upload to
swift using openstacksdk should allow us to migrate to swift fairly
quickly.  Once there, we can work on a number of enhancements which I
will describe in a followup post to zuul-discuss.

-Jim

Open Stack

[OpenStack-Infra] Moving logs into swift (redux)

OpenStack

Community

Documentation

Branding & Legal