[OpenStack-Infra] Moving logs into swift (redux)

Clark Boylan cboylan at sapwetik.org
Mon Jul 16 22:50:13 UTC 2018


On Mon, Jul 16, 2018, at 3:27 PM, James E. Blair wrote:
> Hi,
> 
> As you may know, all of the logs from Zuul builds are currently uploaded
> to a single static fileserver with about 14TB of storage available in
> one large filesystem.  This was easy to set up, but scales poorly, and
> we live in constant fear of filesystem corruption necessitating a
> lengthy outage for repair or loss of data (an event which happens, on
> average, once or twice a year and takes several days to resolve).
> 
> Our most promising approaches to solving this involve moving log storage
> to swift.  We (mostly Joshua) have done considerable work in the past
> but kept hitting blockers.  I think the situation has changed enough
> that the issues we hit before won't be a problem now.  I believe we can
> use this work as a foundation to, relatively quickly, move our log
> storage into swift.  Once there, there's a number of possibilities to
> improve the experience around logs and artifacts in Zuul and general.
> 
> This email is going to focus mostly on how OpenStack Infra can move our
> current log storage and hosting to swift.  I will follow it up with an
> email to the zuul-discuss list about further work that we can do that's
> more generally applicable to all Zuul users.
> 
> This email is the result of a number of previous discussions, especially
> with Monty, and many of the ideas here are his.  It also draws very
> heavily on Joshua's previous work.  Here's the general idea:
> 
> Pre-generate any content for which we currently rely on middleware
> running on logs.openstack.org.  Then upload all of that to swift.
> Return a direct link to swift for serving the content.
> 
> In more detail:
> 
> In addition to using swift as the storage backend, we would also like to
> avoid running a server as an intermediary.  This is one of the obstacles
> we hit last time.  We started to make os-loganalyze (OSLA) a smart proxy
> which could serve files from disk and swift.  It threatened to become
> very complicated and tax the patience of OSLA's reviewers.  OSLA's
> primary author and reviewer isn't really around anymore, so I suspect
> the appetite for major changes to OSLA is even less than it may have
> been in the past (we have merged 2 changes this year so far).
> 
> There are three kinds of automatically generated content on logs.o.o:
> 
> * Directory indexes
> * OSLA HTMLification of logs
> * ARA
> 
> If we pre-generate all of those, we don't need any of the live services
> on logs.o.o.  Joshua's zuul_swift_upload script already generates
> indexes for us.  OSLA can already be used to HTMLify files statically.
> And ARA has a mode to pre-generate its output as well (which we used
> previously until we ran out of inodes.  So today, we basically have what
> we need to pre-generate this data and store it in swift.

Couple of thoughts about this and Ara specifically. Ara static generation easily produces tens of thousands of files. Copying many small files to the log server with rsync was often quite slow (on the order of 10 minutes for some jobs (that is my fuzzy memory though)). I am concerned that HTTP to $swift service will have similar problems with many small files. This is something we should test.

Also, while swift doesn't have inode problems the end user needs to worry about, it does apparently have limits on practical number of objects per container. One of the issues we had in the past, particularly with the swift we had access to, was that each container was not directly accessible by default and you had to configure CDN distribution of each container to be publicly visible. This made creating many containers to shard the objects more complicated than we had hoped. All this to say we may still have to solve the "inode" problem just within the context of swift containers, creating containers, making them visible.

We should do our best to test both of these items and/or follow up with whichever cloud hosts the containers to make sure we aren't missing anything else (possible object creation rate limits for example).

> 
> Another issue we ran into previously was the transition from filesystem
> storage to swift.  This was because in Zuul v2, we could not dynamically
> change the log reporting URL.  However, in Zuul v3, since the job itself
> reports the final log URL, we can handle the transition by creating new
> roles to perform the swift upload and return the swift URL.  We can
> begin by using these roles in a new base job so that we can verify
> correct operation.  Then, when we're ready, we can switch the default
> base job.  All jobs which upload logs to swift will report the new swift
> URL; the existing logs.o.o URLs will continue to function until they age
> out.
> 
> The Zuul dashboard makes finding the location of logs for jobs
> (especially post jobs) simpler.  So we no longer need logs.o.o to find
> the storage location (files or swift) for post jobs -- a user can just
> follow the link from the build history in the dashboard.
> 
> Finally, the apache config (and to some degree, OSLA middleware) handles
> compression.  Ultimately, we don't actually care if the files are
> compressed in storage.  That's an implementation detail (which we care
> about now because we operate the storage).  But it's not a user
> requirement.  In fact, what we want is for jobs to produce logs in
> whatever format they want (plain text, journal, etc).  We want to store
> those.  And we want to present them to the user in the original format.
> Right now we compress many of them before we upload them to the log
> server because, lacking a dedicated upload handler on the log server,
> there's no other way to cause them to be stored compressed.
> 
> If we're relieved of that burden, then the only thing we really care
> about is transfer efficiency.  We should be able to upload files to
> swift with Content-Encoding: gzip, and, likewise, users should be able
> to download files with Accept-Encoding: gzip.  We should be able to have
> efficient transfer without having to explicitly compress and rename
> files.  Our first usability win.
> 
> The latest version of the zuul_swift_upload script uses the swift
> tempurl functionality to upload logs.  This is because it was designed
> to run on untrusted nodes.  A closer analog to our current Zuul v3 log
> upload system would be to run the uploader on the executor, giving it a
> real swift credential.  It can then upload logs to swift in the normal
> manner, rather than via tempurl.  It can also create containers as
> needed -- another consideration from our earlier work.  By default, it
> could avoid creating containers, but we could configure it to create,
> say, containers for each first-level of our sharding scheme.  This could
> be a general feature of the role that allows for per-site customization.

Just a side note that creating containers doesn't necessarily make them publicly available in all deployments. This was an issue we ran into in the past. Rax containers could only be accessed publicly if distributed through their CDN.

> 
> I think that's the approach we should start with, because it will be the
> easiest transition from our current scheme.  However, in the future, we
> can move to having the uploads occur from the test nodes themselves
> (rather than, or in addition to, the executor), by having a two-part
> system.  The first part runs on the executor in a trusted context and
> creates any containers needed, then generates a tempurl, and uses that
> to have the worker nodes upload to the container directly.  I only
> mention this to show that we're not backing ourselves permanently into
> executor-only uploads.  But we shouldn't consider this part of the first
> phase.
> 
> We have also discussed using multiple swifts.  It may be easiest to
> start with one, but in a future where we have executor affinity in Zuul,
> we may want to upload to the nearest swift.  In that case, we can modify
> the role to, rather than be configured with a single swift, support
> multiple swifts, and use the executor affinity information to determine
> if there is a swift colocated in the executor's cloud, and if not, use a
> fallback.  This way we can use multiple swifts as they are available,
> but not require them.
> 
> To summarize: static generation combined with a new role to upload to
> swift using openstacksdk should allow us to migrate to swift fairly
> quickly.  Once there, we can work on a number of enhancements which I
> will describe in a followup post to zuul-discuss.

This is exciting. I think that zuulv3 puts us in a much better position overall to make use of swift. Job secrets make managing credentials simpler, the dashboard gives us historical browsing of logs. In return we should be able to care less about rotating logs (swift can automatically expire objects), available disk, available inodes, and general reliability of the backing storage.

Finally, we will probably need to make changes the logstash processing of logs to fetch the non htmlified log contents since they will be stored separately now. Easy enough, will just need to be done.

> 
> -Jim




More information about the OpenStack-Infra mailing list