[OpenStack-Infra] Thoughts on evolving Zuul

James E. Blair corvus at inaugust.com
Thu Feb 26 16:59:08 UTC 2015


Hi,

I've been wanting to make some structural changes to Zuul to round it
out into a coherent system.  I don't want to change it too much, but I'd
also like a clean break with some of the baggage we've been carrying
around from earlier decisions, and I want it to be able to continue to
scale up (the config in particular is getting hard to manage with >500
projects).

I've batted a few ideas around with Monty, and I've written up my
thoughts below.  This is mostly a narrative exploration of what I think
it should look like.  This is not exhaustive, but I think it explores
most of the major ideas.  The next step is to turn this into a spec and
start iterating on it and getting more detailed.

I'm posting this here first for discussion to see if there are any
major conceptual things that we should address before we get into more
detailed spec review.  Please let me know what you think.

-Jim

=======
 Goals
=======

Make zuul scale to thousands of projects.
Make Zuul more multi-tenant friendly.
Make it easier to express complex scenarios in layout.
Make nodepool more useful for non virtual nodes.
Make nodepool more efficient for multi-node tests.
Remove need for long-running slaves.
Make it easier to use Zuul for continuous deployment.

To accomplish this, changes to Zuul's configuration syntax are
proposed, making it simpler to manage large number of jobs and
projects, along with a new method of describing and running jobs, and
a new system for node distribution with Nodepool.

=====================
 Changes To Nodepool
=====================

Nodepool should be made to support explicit node requests and
releases.  That is to say, it should act more like its name -- a node
pool.

Rather than having servers add themselves to the pool by waiting for
them (or Jenkins on their behalf) to register with gearman, nodepool
should instead define functions to supply nodes on demand.  For
example it might define the gearman functions "get-nodes" and
"put-nodes".  Zuul might request a node for a job by submitting a
"get-nodes" job with the node type (eg "precise") as an argument.  It
could request two nodes together (in the same AZ) by supplying more
than one node type in the same call.  When complete, it could call
"put-nodes" with the node identifiers to instruct nodepool to return
them (nodepool might then delete, rebuild, etc).

This model is much more efficient for multi-node tests, where we will
no longer need to have special multinode labels.  Instead the
multinode configuration can be much more ad-hoc and vary per job.

The testenv broker used by tripleo behaves somewhat in this manner
(though it only supports static sets of resources).  It also has logic
to deal with the situation where Zuul might exit unexpectedly and not
return nodes (though it should strive to do so).  This feature in the
broker should be added to nodepool.  Additionally, nodepool should
support fully static resources (they should become just another node
type) so that it can handle the use case of the test broker.

=================
 Changes To Zuul
=================

Zuul is currently fundamentally a single-tenant application.  Some
folks want to use it in a multi-tenant environment.  Even within
OpenStack, we have use for multitenancy.  OpenStack might be one
tenant, and each stackforge project might be another.  Even if the big
tent discussion renders that thinking obsolete, we may still want the
kind of separation multi-tenancy can provide.  The proposed
implementation is flexible enough to run Zuul completely single tenant 
with shared everything, completely multi-tenant with shared nothing, and
everything in-between.  Being able to adjust just how much is shared or
required, and how much can be left to individual projects will be very
useful.

To support this, the main configuration should define tenants, and
tenants should specify config files to include.  These include files
should define pipelines, jobs, and projects, all of which are
namespaced to the tenant (so different tenants may have different jobs
with the same names)::

  ### main.yaml
  - tenant:
      name: openstack
      include:
	- global_config.yaml
	- openstack.yaml

Files may be included by more than one tenant, so common items can be
placed in a common file and referenced globally.  This means that for,
eg, OpenStack, we can define pipelines and our base job definitions
(with logging info, etc) once, and include them in all of our tenants::

  ### main.yaml (continued)
  - tenant:
      name: openstack-infra
      include:
	- global_config.yaml
	- infra.yaml

A tenant may optionally specify repos from which it may derive its
configuration.  In this manner, a repo may keep its Zuul configuration
within its own repo.  This would only happen if the main configuration
file specified that it is permitted::

  ### main.yaml (continued)
  - tenant:
      name: random-stackforge-project
      include:
	- global_config.yaml
      repos:
	- stackforge/random  # Specific project config is in-repo

Jobs defined in-repo may not have access to the full feature set
(including some authorization features).  They also may not override
existing jobs.

Job definitions continue to have the features in the current Zuul
layout, but they also take on some of the responsibilities currently
handled by the Jenkins (or other worker) definition::

  ### global_config.yaml
  # Every tenant in the system has access to these jobs (because their
  # tenant definition includes it).
  - job:
      name: base
      timeout: 30m
      node: precise   # Just a variable for later use
      nodes:  # The operative list of nodes
	- name: controller
	  image: {node}  # Substitute the variable
      auth:  # Auth may only be defined in central config, not in-repo
	swift:
	  - container: logs
      pre-run:  # These specify what to run before and after the job
	- zuul-cloner
      post-run:
	- archive-logs

Jobs have inheritance, and the above definition provides a base level
of functionality for all jobs.  It sets a default timeout, requests a
single node (of type precise), and requests swift credentials to
upload logs.  Further jobs may extend and override these parameters::

  ### global_config.yaml (continued)
  # The python 2.7 unit test job
  - job:
      name: python27
      parent: base
      node: trusty

Our use of job names specific to projects is a holdover from when we
wanted long-lived slaves on jenkins to efficiently re-use workspaces.
This hasn't been necessary for a while, though we have used this to
our advantage when collecting stats and reports.  However, job
configuration can be simplified greatly if we simply have a job that
runs the python 2.7 unit tests which can be used for any project.  To
the degree that we want to know how often this job failed on nova, we
can add that information back in when reporting statistics.  Jobs may
have multiple aspects to accomodate differences among branches, etc.::

  ### global_config.yaml (continued)
  # Version that is run for changes on stable/icehouse
  - job:
      name: python27
      parent: base
      branch: stable/icehouse
      node: precise

  # Version that is run for changes on stable/juno
  - job:
      name: python27
      parent: base
      branch: stable/juno  # Could be combined into previous with regex
      node: precise        # if concept of "best match" is defined

Jobs may specify that they require more than one node::

  ### global_config.yaml (continued)  
  - job:
      name: devstack-multinode
      parent: base
      node: trusty  # could do same branch mapping as above
      nodes:
	- name: controller
	  image: {node}
	- name: compute
	  image: {node}

Jobs defined centrally (i.e., not in-repo) may specify auth info::

  ### global_config.yaml (continued)  
  - job:
      name: pypi-upload
      parent: base
      auth:
	password:
	  pypi-password: pypi-password
	  # This looks up 'pypi-password' from an encrypted yaml file
	  # and adds it into variables for the job

Pipeline definitions are similar to the current syntax, except that it
supports specifying additional information for jobs in the context of
a given project and pipeline.  For instance, rather than specifying
that a job is globally non-voting, you may specify that it is
non-voting for a given project in a given pipeline::

  ### openstack.yaml
  - project:
      name: openstack/nova
      gate:
	queue: integrated  # Shared queues are manually built
	jobs:
	  - python27  # Runs version of job appropriate to branch
	  - devstack
	  - devstack-deprecated-feature:
	      branch: stable/juno  # Only run on stable/juno changes
	      voting: false  # Non-voting
      post:
	jobs:
	  - tarball:
	      jobs:
		- pypi-upload

Currently unique job names are used to build shared change queues.
Since job names will no longer be unique, shared queues must be
manually constructed by assigning them a name.  Projects with the same
queue name for the same pipeline will have a shared queue.

A subset of functionality is avaible to projects that are permitted to
use in-repo configuration::

  ### stackforge/random/.zuul.yaml
  - job:
      name: random-job
      parent: base      # From global config; gets us logs
      node: precise

  - project:
      name: stackforge/random
      gate:
	jobs:
	  - python27    # From global config
	  - random-job  # Flom local config

The executable content of jobs should be defined as ansible playbooks.
Playbooks can be fairly simple and might consist of little more than
"run this shell script" for those who are not otherwise interested in
ansible::

  ### stackforge/random/playbooks/random-job.yaml
  ---
  hosts: controller
  tasks:
    - shell: run_some_tests.sh

Global jobs may define ansible roles for common functions::

  ### openstack-infra/zuul-playbooks/python27.yaml
  ---
  hosts: controller
  roles:
    - tox:
	env: py27

Because ansible has well-articulated multi-node orchestration
features, this permits very expressive job definitions for multi-node
tests.  A playbook can specify different roles to apply to the
different nodes that the job requested::

  ### openstack-infra/zuul-playbooks/devstack-multinode.yaml
  ---
  hosts: controller
  roles:
    - devstack
  ---
  hosts: compute
  roles:
    - devstack-compute

Additionally, if a project is already defining ansible roles for its
deployment, then those roles may be easily applied in testing, making
CI even closer to CD.  Finally, to make Zuul more useful for CD, Zuul
may be configured to run a job (ie, ansible role) on a specific node.

The pre- and post-run entries in the job definition might also apply
to ansible playbooks and can be used to simplify job setup and
cleanup::

  ### openstack-infra/zuul-playbooks/zuul-cloner.yaml
  ---
  hosts: all
  roles:
    - zuul-cloner: {{zuul}}

Where the zuul variable is a dictionary containing all the information
currently transmitted in the ZUUL_* environment variables.  Similarly,
the log archiving script can copy logs from the host to swift.

A new Zuul component would be created to execute jobs.  Rather than
running a worker process on each node (which requires installing
software on the test node, and establishing and maintaining network
connectivity back to Zuul, and the ability to coordinate actions across
nodes for multi-node tests), this new component will accept jobs from
Zuul, and for each one, write an ansible inventory file with the node
and variable information, and then execute the ansible playbook for that
job.  This means that the new Zuul component will maintain ssh
connections to all hosts currently running a job.  This could become a
bottleneck, but ansible and ssh have been known to scale to a large
number of simultaneous hosts, and this component may be scaled
horizontally.  It should be simple enough that it could even be
automatically scaled if needed.  In turn, however, this does make node
configuration simpler (test nodes need only have an ssh public key
installed) and makes tests behave more like deployment.



More information about the OpenStack-Infra mailing list