[OpenStack-Infra] Zuul roadmap

James E. Blair corvus at inaugust.com
Wed Dec 6 15:34:06 UTC 2017


Clint Byrum <clint at fewbar.com> writes:

> I know a bunch of this stuff is janky as all get out, because as much of the
> jankiness is my own fault as anybody else's. But so much work has gone into
> zuulv3 beyond what OpenStack needs, I am still not convinced we need to wait
> for any of this. Maybe the zuul-web stuff, since changing URLs after a release
> is going to be a bear.
>
> I'm confident we'll get some of these done soon, and I may even get a chance to
> contribute directly. But we all know that complexity creeps into engineering in
> the most frustrating ways. I'd prefer that this list gets pared down, and that
> the release comes basically at or right before PTG, even if this list doesn't
> all happen.

I understand where you're coming from, but please also understand that
folks are already starting to show up in #zuul asking us the same
questions repeatedly because of how janky it is.  We're getting really
close to a point where we're spending too much time talking about how
things are going to get a lot easier for folks in just a few weeks
instead of actually doing those things.  So let me go through the list
and either expand on why I think a thing is important for the release,
or move it down in priority.

>> * granular quota support in nodepool (tobias)
>> * zuul-web dashboard (tristanC)
>> * update private key api for zuul-web (jeblair)

These things are basically done.  I agree they don't have to block the
release, but they are so likely to land very soon, we should just plan
for that.  If they don't we won't wait for them.

>> * github event ingestion via zuul-web (jlk)

Zuul currently has two web servers, and telling people how to set both
of them up is complicated.  This is the sort of thing that will cause
people to think that either this software is not ready to use, or it's
too complicated.

>> * abstract flag (do not run this job) (jeblair)

(I have a WIP patch for this)

We can move this to v3.1.

>> * zuul_json fixes (dmsimard)

This is a known bug that causes Zuul to fail with certain perfectly
valid uses of Ansible.  It's easy for users to hit, but it should also
be easy to fix.

>> * refactor config loading (jeblair)

Originally this task was mostly about solving the forward inheritance
problem, which is done.  At this point, I consider the task to be more
akin to double checking that we aren't missing anything major from the
job language that we can't fix in the future.

>> * protected flag (inherit only within this project) (jeblair)

(Tobias has a patch for this)

We can move this to v3.1.

>> * refactor zuul_stream and add testing (mordred)

This is important because there are still a number of cases where errors
in Ansible are not reported in the streaming log.  We need to handle
those cases, but this code has evolved quite a bit from its original
implementation to the point where it is difficult to understand, and it
has very limited testing.  This module is nearly frozen until this
refactor happens.  This means that new users (the most likely to hit the
bugs currently masked by this) are going to have a frustrating time --
they'll have to go look at executor logs to identify job failures.

Having said that, if the release came down to this alone, we could
probably delay it.  I'd like to keep this on the list and prioritize
work on it so we can get it into v3.0, but I'm okay deferring it if it's
the last thing standing.

>> * getting-started documentation (leifmadsen)

This is also really important to have for folks -- when we release 3.0
and say "okay, we've spent 2 years telling you not to use it, go use it
now" we should have some instructions to help people do that.  It's a
complicated system, and I don't want folks bouncing off of it the first
time they try it.

However, I'll repeat the caveat from the last item here -- if it's the
last thing standing, we don't have to wait for it.

>> * demonstrate openstack-infra reporting on github
(pabelanger has since volunteered for this and begun work)

This item is about ensuring that the GitHub support works at scale.
We've had a number of folks using the GitHub driver, but as soon as we
started having the openstack-infra instance of Zuul watch some busy
GitHub repos, we started seeing errors in the log.  This is an important
new feature, and I want to make sure it's ready.  This item will either
be easy because we've already fixed the major issues, or we're going to
discover serious bugs that we would deal with soon after release anyway.

>> * cross-source dependencies

This is part of the work of adding a second source.  One of Zuul's major
features is cross-repo dependencies, and when we release GitHub support
for Zuul, I think it's important that we're able to tell a story about
how Zuul can integrate all of the repositories it works with.  It is so
much more compelling.  I understand that some folks don't need this, but
for a lot of the folks interested in Zuul and awaiting the 3.0 release
it is.

>> * add command socket to scheduler and merger for consistent start/stop
(pabelanger has since volunteered for this and begun work)

The fact that the different Zuul services behave differently is awkward
for new users -- and old alike -- we constantly struggle with this in
openstack-infra operations.  This is a big usability improvement we can
make quickly.  It's already in progress and needn't hold us up.

>> * finish git driver
(fbo has since volunteered for this and begun work)

This is important for the zuul-jobs repo.  We want to advocate for a
standard library of jobs that everyone can use.  We're not going to get
everything about that right in 3.0 for sure.  However, we want people to
at least be able to start participating and understand the mechanism.
If we don't do this, we would have to instruct every Zuul user to either
fork the repo, or set up a Gerrit connection to OpenStack's Gerrit
server.  Neither of which is how we expect this to be used in the long
run.

Substantial work already exists here, so finishing it seems prudent so
that we don't tell folks to set things up one way and then immediately
try to convince them to change it.

>> * standardize javascript tooling

This is about making the web portion of Zuul actually deployable.  It
would be great to have it in 3.0.  If it slips and we're still just
stuck with a "wget" script, it's probably not the end of the world.
However, some work has been done on this already, so I'm optimistic we
can finish it in time.

----------------------

Having gone through that, there are some things we need to add to the
list:

* Static driver in nodepool

As discussed previously on the list, Tristan has this ready to go, and
we should merge it before 3.0 because not only is it an important
feature, but also it radically simplifies the process of experimenting
with Zuul for new users.

Tristan has patches ready for this.

* Add finger gateway

The fact that the executor must be started as root in order to listen on
port 79 is awkward for new users.  It can be avoided by configuring it
to listen on a different port, but that's also awkward.  In either case,
it's a significant hurdle, and it's one of the most frequently asked
questions in IRC.  The plan to deal with this is to create a new service
solely to multiplex the finger streams.  This is very similar to the
zuul-web service for multiplexing the console streams, so all the
infrastructure is in place.  And of course, running this service will be
optional, so it means that new users don't even have to deal with it to
get up and running, like they do now.  Adding a new service to the 3.0
list should not be done lightly, but the improvement in experience for
new users will be significant.

David Shrewsbury has started on this.  I don't think it is out of reach.

* Github3.py needs to be released

This one is fairly non-negotiable -- I don't think we can release our
software until our dependencies have also been released.  We currently
require the development branch of github3.py.

Jesse has volunteered to inquire about the feasibility of a release.

* Resolve issues with gitpython

We also currently depend on a private branch of gitpython which contains
an old release and a necessary patch.  We earlier had a problem where a
current release included a major performance regression for us.  I
submitted a PR to address that, and I believe it has made it into a
release.  We should verify that and attempt to use the latest release
and determine if there are further regressions.  Previously, it was
sufficient to just run the unit tests with different gitpython versions
and observe the resulting run times to see the regression.

This one is also necessary.

I think all of this is achievable before the PTG.  I hope you and others
do too and are willing to jump in and help -- there are still a few
unclaimed items.

-Jim



More information about the OpenStack-Infra mailing list