[OpenStack-Infra] Zuul info for PTG

James E. Blair corvus at inaugust.com
Sat Feb 11 01:20:56 UTC 2017


Hi!

We're meeting in person at the Project Team Gathering in Atlanta soon,
so I wanted to share the current state of Zuul and how we're thinking
about setting up jobs so we're all able to contribute to our goal of
actually running Zuul v3 when we are there.

Also, we might borrow some of these words for the user manual later.

There's a lot of info in the spec about how jobs are defined in Zuul
v3, some of it general, and some of it directed at how we will do things
specifically for OpenStack.  But there are also areas where it is
intentionally vague as it was not clear at the time the best way to
actually implement some ideas.  We're much closer now, and can fill in
some of the gaps.  I'd like to do that and present this information in
the context of how I expect us to use it in the OpenStack project.

How Configuration Works
-----------------------

The first significant change is that Zuul's configuration is contained
almost entirely in git repositories that Zuul directly manages.  We will
have a small snippet of YAML that will live in project-config which our
system-config driven ansible-puppet system will install on the main Zuul
server.  That YAML file is the tenant config file for Zuul, which lists
the tenants we will have (we will probably start with just one,
"OpenStack", but we might further divide later), and the repos that are
associated with that tenant (pretty much everything in Gerrit).
Eventually we may have Zuul automatically query these, but for now, we
will have to list them.  However, unlike our current 19,000 line
zuul.yaml file, all we need to do is list the names of the repos.  So
it'll only be 1,700 lines.  :)

For the PTG, we'll only be working with a small number of repos, so we
can just list project-config, zuul, and nodepool.  We can probably put
this in project-config/zuul/main.yaml.

When listing each of the projects in main.yaml, we will specify whether
it is a "config repo" or a "project repo".  The distinction is explained
below in _Security Contexts_; for the moment, just know that this is
where we will configure that.

When Zuul reads this config file from the local disk, it then performs a
bunch of git operations to read the rest of its configuration from those
repos.  This is how the live dynamic reconfiguration is possible.

We will *also* have a significant chunk of Zuul configuration inside of
project-config that will be read this way.  It's worth noting that is
distinct from the tenant config.  These things don't have to be in the
same repo, I think that just makes the most sense for the way we are
organized in OpenStack.  The main part of the dynamic config will be in
project-config/zuul.yaml.  That is where we will define pipelines and
any centralized jobs.

Security Contexts
-----------------

Whenever a playbook runs, it runs with a security context which is
determined by the repo in which the playbook resides.

Repositories which are designated as "project repos" are considered
untrusted.  This means that any playbook defined in that repo is not
permitted to access anything on the Zuul launcher.  It can use any
Ansible module on the remote worker nodes, but it can not, for example,
access local storage on the launcher itself, nor can it use non-standard
Ansible plugins.  Because of the relative safety that this affords,
project repos are permitted to change their configuration dynamically --
that is, a proposed change to a .zuul.yaml file within a project repo
will cause Zuul to run *with the proposed configuration change* when
evaluating it.

Repositories which are "config repos" are trusted and can contain jobs
and configuration which could potentially be used to compromise the Zuul
system.  Pre and Post Playbooks (more on this later) defined here *are*
permitted to access local storage on the Zuul launcher.  They can also
use ansible plugins which may allow even further local or network
actions.  Because of this extra level of access, proposed changes to
config repos are not run with the new configuration.  They retain their
current configuration until the proposed change is approved and actually
lands.  However, it's worth noting that since Zuul can detect that
event, it will be able to immediately reconfigure itself after a change
to a config repo lands.

Obviously, in OpenStack, people are pretty excited about being able to
see job configuration changes run before they land.  Any jobs which are
configured in project-config won't be able to take advantage of that
(though, because of the immediate reconfiguration, iteration will be
much faster than before).  To deal with that, we can restrict ourselves
to defining only the jobs or job components which require extra care or
levels of access in project-config.  The remainder of the configuration
can either be placed in the projects themselves, or perhaps in a newly
created central repo similar to project-config, but without the same
level of access.

As an example, the configuration of the basic devstack jobs could be
placed in devstack-gate, which could be configured as a project repo.
Projects will still be able to inherit jobs defined there and create
their own variants within their own repos.  All of this would be fully
subject to dynamic reconfiguration.

Thoughts on Job Organization
----------------------------

Zuul scans all of the repos it knows about in order to construct its
full running configuration.  In essence, OpenStack's Zuul configuration
will be spread across nearly 2000 repositories.  We can make this as
centralized or as distributed as we choose.

A job can only be defined in a single repository.  Zuul loads its
configuration in the order specified in the tenant configuration file.
This means we should *not* list projects in alphabetical order (sorry).
Instead, we should list project-config, devstack-gate, and other
"central" repositories first so that they have precedence (we can list
other repositories in alphabetical order after these, if we want).  If a
project adds a job definition that shadows an already existing
definition from a previous project, Zuul will raise an error and prevent
that configuration from being loaded.  The necessity of this should be
obvious once you consider the idea that otherwise, any project could
alter the definition of a job defined by another project.

Within a single repository, a job can be defined multiple times.  Each
of these definitions is a "variant" of that job.  The reason to use
variants is generally to deal with branch differences.  For example, we
might define the python27 job to run on Xenial, but we might define a
variant which runs on Trusty for changes on a stable branch.  The upshot
is a configuration which is very readable and should be clear to most
casual users:

  - job:
      name: python27
      node: xenial

  - job:
      name: python27
      branch: stable/juno
      node: precise

Hopefully that reads as "run python27 on xenial, except use precise for
the stable/juno branch".

While all of these variants must be defined in the same repo, any job
may be used in any other repo.  That allows us to define "python27" in
project-config, and "devstack" in "devstack-gate", and let any project
make use of it.  It also lets us define "zuul-nodepool-integration" in
the zuul project but also have nodepool make use of it.

Job Definition
--------------

The way a job is attached to a project is very similar to v2.  It
happens in a "project-pipeline".  A pipeline is a series of conditions,
triggers, and reporters which define the workflow.  A project-pipeline
is the application of that pipeline to a project.  For example:

  - pipeline:           }
      name: gate        }  This is a pipeline.
      trigger: gerrit   }  Obviously.
      reporter: gerrit  }

  - project:            }  This is a project definition.
      name: zuul        }
      gate:             }  } This is a project-pipeline.
        jobs:           }  }
          - pep8        }  }
          - python27    }  }
       
The project-pipeline takes on an important new role in v3.  Not only
does it indicate what job to run, but it may now modify the job.  The
job entry in the project-pipeline is actually another job variant.  It
can look like this:

  - project:
      name: zuul
      gate:
        jobs:
          - python27:
              timeout: 60

That means run the python27 job, but set the timeout to 60 minutes.  You
can even list a job multiple times with different branch specifiers to
modify the run characteristics for each branch.

Many of the 2300 lines of job regexes we currently have in zuul.yaml are
focused on changing aspects of a job related to an individual
repository.  This lets us drop most of that and makes for a far more
comprehensible system.

It might be dangerous to allow modification to some jobs outside of the
repo in which they are defined.  For example, we would not want anyone
to change what is run in the "pypi-upload" job since it will run with
credentials which can access pypi.  I expect us to define it in
project-config, which runs in a secure context, and of course that means
that no other repo can modify it with a variant.  However, other repos
might modify it using a project-pipeline variant.  To handle this, we
have added the concept of a "final" job.  We will expose that as a job
attribute (so you can mark a job as final explicitly).  But also, any
job with authentication information which is not inheritable will
automatically be made final.

A final job may have non-executable attributes (such as branch selectors
and the voting flag) configured in a project-pipeline, but may not alter
the playbook, nodes, variables, or other executable attributes.

Job Construction
----------------

When Zuul decides to run a job, it has to consider all of these variants
and figure out what it should actually do.  The process it uses is
simple: it considers each variant in the order it was defined, and
applies each matching one in order.  Then the "implied variant" that
appears in the project-pipeline config is applied last.

Internally, this process is referred to as "freezing" a job.  It happens
right before a job is executed, and is performed to ensure that a job
runs with just the right configuration, and that configuration does not
change while the job is running.

Inheritance
-----------

Jobs also can inherit from each other.  Unlike variants, jobs may
inherit from jobs in other repositories (so nodepool can define a
devstack job that inherits from the main devstack job defined in
devstack-gate).  Also unlike variants, inheritance is not dynamic at run
time.  Instead, when a job inherits from another, it only inherits from
the first variant of that other job.  That means the best practice for
this will be to define the most general variant first, as that is what
will be used by any further jobs which inherit from it.

It's possible we may want to make inheritance a bit more dynamic.  We
might be able to in the future, but for now, this is how it's
implemented.

The way I expect this to be used is for us to define a *very general*
base job, and start inheriting from there.

For example:

  - job:
      name: base
      timeout: 30
      node: xenial
      pre-run: copy-repos
      post-run: copy-logs

  - job:
      name: python27
      parent: base

Playbooks
---------

Every job in Zuul v3 is associated with a playbook.  The metadata about
a job (its name, what nodes it runs on, etc) are all defined in Zuul
configuration as described above.  The actual execution content is in a
playbook.  Unless otherwise specified (with the "run" attribute) Zuul
will look for a playbook with the same name as the job in the
"playbooks/" directory of the repo where the job is defined.  Zuul will
look for such a playbook at each stage of the inheritance hierarchy and
run the first one it finds.  So in our example above, it would look for
'project-config/playbooks/python27.yaml'.  I expect us to define that
playbook, so it will run that.  However, if we do not, it will try to
run 'project-config/playbooks/base.yaml'.  I don't expect us to define
that, meaning that any job that inherits from base without a playbook is
an error.

Jobs may also define pre-playbooks and post-playbooks using the
'pre-run' and 'post-run' attributes.  Zuul will again look for those in
the playbooks directory where the job is defined.  They will run with
the security context of the repo in which they are defined as well.

I expect us to define the 'copy-repos' pre-playbook in project-config.
In Zuul v3, the repos for a job are set up on the zuul launcher node,
and then pushed onto the worker nodes (rather than pulled from zuul
mergers as in v2).  That mechanism is not built into Zuul itself --
because it doesn't have to be.  With security contexts, we can perform
any local action we need to on the zuul launcher by writing ansible
playbooks.  So rather than hard-coding the idea of pushing repos onto
the node, we will just use ansible to do it.  This allows a lot of
flexibility and customization in the sort of things that are run before
and after jobs, and reduces the amount of logic that is hardcoded in
Zuul itself.

However, this is obviously such a fundamental thing that we don't want
every Zuul user to have to implement it from scratch.  So once we have a
working version of 'copy-repos', we will add it to a "standard library"
of playbooks that we will ship with Zuul.  Other playbooks we might add
for the convenience of others are "run a shell script", "run tox", etc.

We might make that a separate repo in the future.

While a job will only run a single playbook as its main content, each
time a job inherits from another (or a variant is applied) with a pre-
or post- run attribute, those playbooks are nested.  So in our case, the
'copy-repos' playbook defined in the base job will always be the first
to run; if python27 added its own pre-playbook, it would run after
copy-repos; similarly, a python27 post-playbook would run immediately
after the main playbook, and the 'copy-logs' playbook from base would
run last.  This is so that later jobs can rely on the fundamental
actions like setting up a repo or copying logs to happen first and last
respectively.

Roles
-----

Playbooks feature prominently in the configuration, but it's important
to remember that the re-usable component of executable definition in
ansible is the role.  So as we write jobs, we should focus on writing
roles which are usable by multiple jobs (in multiple playbooks).  We've
already started this process in the devstack-gate job.  Playbooks should
often be simple affairs simply listing a series of roles, much in the
same way that portions of our JJB config list a series of 'bulider'
macros.

Jobs in Zuul v3 will be able to specify that they require certain roles
to operate and Zuul will make sure they are installed before running the
job.

Nodepool
--------

Zuul v3 expects a very different version of Nodepool to be running than
that used by v2.  It explicitly requests nodes from Nodepool when
needed, rather than having them assigned based on estimates and
deficits.  A job can be very specific about what kinds of nodes it
needs, specifying how many of each image type and what the name for each
node should be.  Those will be placed in the ansible inventory for the
job.

Additionally, we can define "Nodesets" which are standardized
definitions of nodes which can be referenced by name.  So we might
define a "multinode" nodeset as a controller and a compute node, both
running xenial.  Then jobs may simply refer to "multinode" and get a
grouping matching that description.

What to Expect at the PTG
-------------------------

By the time we get to the PTG, we think we should have the following
ready to go:

* Nodepool image building.  This is already in production.  We can use
  the running production nodepool images with our new version of
  nodepool.

* Nodepool-launcher able to launch nodes and satisfy requests from zuul.

* Zuul-scheduler and zuul-launcher able to run.

* Servers to run them on.

* Most of the job configuration features.

* Ability to run ansible playbooks (including pre and post), in both
  secure and insecure contexts.

* Dynamic reconfiguration.

* Simple console streaming.

Notable things that may or will not be ready by the PTG.

* We probably won't have support for job secrets.  That shouldn't be an
  impediment, except that we will probably want to copy log files
  somewhere which will require access.  Fortunately, we can implement
  log file copying the way we do today: use an SSH key on the zuul
  launcher and write a post-playbook to use that to copy logs.  That
  should work as long as the post-playbook is defined in a secure
  context.

* Nodepool does not yet delete nodes.  We might get that done by the
  PTG.  If not, that's why we have for loops in bash.

* We may not have support for roles or for specifying that zuul should
  set up repositories for a job other than the one being tested.
  Neither of these should be an impediment for a simple job, though both
  are required for more complex jobs like devstack-gate.

* We hope to have some limited console streaming (such as from an
  individual node), however, we may not have a system to multiplex
  streams from multiple nodes yet.

I believe we are in a good position to make some headway in actually
setting up our nascent Zuul v3 installation.  I think we can expect to:

* Get Zuul and Nodepool daemons running.

* Create a basic Zuul v3 configuration which we can build on as we
  progress.

* Write the fundamental pre and post playbooks which we want to run for
  every job.

* Write a hello world job and playbook, and run it on the zuul repo.

* Continue to work on other jobs until we hit the limits of what has
  been implemented to date in Zuul.

I think this will be a great opportunity for all of us to work together
on this and it will provide a solid foundation for us to roll out Zuul
v3 as it's ready.  I'm looking forward to working on this with everyone
soon!

-Jim



More information about the OpenStack-Infra mailing list