<div dir="ltr"><span style="font-size:12.7272720336914px">Thanks Jim. This makes a lot of sense and will hopefully make things simpler and more robust. </span><div style="font-size:12.7272720336914px"><br></div><div style="font-size:12.7272720336914px">Just a few questions:<div><div>1. It looks like zuul can request a specific set of nodes for a job. Do you envision the typical ansible playbook to install additional things required for the jobs? or would zuul always need to request a suitable node for the job?</div><div>2. Would there be a way to share environment variables across multiple shell tasks? For example would it be possible to reference a variable defined in the job yaml file from inside of a shell script? </div><div><br></div><div>-Khai</div></div><div><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Feb 26, 2015 at 8:59 AM, James E. Blair <span dir="ltr"><<a href="mailto:corvus@inaugust.com" target="_blank">corvus@inaugust.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
I've been wanting to make some structural changes to Zuul to round it<br>
out into a coherent system. I don't want to change it too much, but I'd<br>
also like a clean break with some of the baggage we've been carrying<br>
around from earlier decisions, and I want it to be able to continue to<br>
scale up (the config in particular is getting hard to manage with >500<br>
projects).<br>
<br>
I've batted a few ideas around with Monty, and I've written up my<br>
thoughts below. This is mostly a narrative exploration of what I think<br>
it should look like. This is not exhaustive, but I think it explores<br>
most of the major ideas. The next step is to turn this into a spec and<br>
start iterating on it and getting more detailed.<br>
<br>
I'm posting this here first for discussion to see if there are any<br>
major conceptual things that we should address before we get into more<br>
detailed spec review. Please let me know what you think.<br>
<br>
-Jim<br>
<br>
=======<br>
Goals<br>
=======<br>
<br>
Make zuul scale to thousands of projects.<br>
Make Zuul more multi-tenant friendly.<br>
Make it easier to express complex scenarios in layout.<br>
Make nodepool more useful for non virtual nodes.<br>
Make nodepool more efficient for multi-node tests.<br>
Remove need for long-running slaves.<br>
Make it easier to use Zuul for continuous deployment.<br>
<br>
To accomplish this, changes to Zuul's configuration syntax are<br>
proposed, making it simpler to manage large number of jobs and<br>
projects, along with a new method of describing and running jobs, and<br>
a new system for node distribution with Nodepool.<br>
<br>
=====================<br>
Changes To Nodepool<br>
=====================<br>
<br>
Nodepool should be made to support explicit node requests and<br>
releases. That is to say, it should act more like its name -- a node<br>
pool.<br>
<br>
Rather than having servers add themselves to the pool by waiting for<br>
them (or Jenkins on their behalf) to register with gearman, nodepool<br>
should instead define functions to supply nodes on demand. For<br>
example it might define the gearman functions "get-nodes" and<br>
"put-nodes". Zuul might request a node for a job by submitting a<br>
"get-nodes" job with the node type (eg "precise") as an argument. It<br>
could request two nodes together (in the same AZ) by supplying more<br>
than one node type in the same call. When complete, it could call<br>
"put-nodes" with the node identifiers to instruct nodepool to return<br>
them (nodepool might then delete, rebuild, etc).<br>
<br>
This model is much more efficient for multi-node tests, where we will<br>
no longer need to have special multinode labels. Instead the<br>
multinode configuration can be much more ad-hoc and vary per job.<br>
<br>
The testenv broker used by tripleo behaves somewhat in this manner<br>
(though it only supports static sets of resources). It also has logic<br>
to deal with the situation where Zuul might exit unexpectedly and not<br>
return nodes (though it should strive to do so). This feature in the<br>
broker should be added to nodepool. Additionally, nodepool should<br>
support fully static resources (they should become just another node<br>
type) so that it can handle the use case of the test broker.<br>
<br>
=================<br>
Changes To Zuul<br>
=================<br>
<br>
Zuul is currently fundamentally a single-tenant application. Some<br>
folks want to use it in a multi-tenant environment. Even within<br>
OpenStack, we have use for multitenancy. OpenStack might be one<br>
tenant, and each stackforge project might be another. Even if the big<br>
tent discussion renders that thinking obsolete, we may still want the<br>
kind of separation multi-tenancy can provide. The proposed<br>
implementation is flexible enough to run Zuul completely single tenant<br>
with shared everything, completely multi-tenant with shared nothing, and<br>
everything in-between. Being able to adjust just how much is shared or<br>
required, and how much can be left to individual projects will be very<br>
useful.<br>
<br>
To support this, the main configuration should define tenants, and<br>
tenants should specify config files to include. These include files<br>
should define pipelines, jobs, and projects, all of which are<br>
namespaced to the tenant (so different tenants may have different jobs<br>
with the same names)::<br>
<br>
### main.yaml<br>
- tenant:<br>
name: openstack<br>
include:<br>
- global_config.yaml<br>
- openstack.yaml<br>
<br>
Files may be included by more than one tenant, so common items can be<br>
placed in a common file and referenced globally. This means that for,<br>
eg, OpenStack, we can define pipelines and our base job definitions<br>
(with logging info, etc) once, and include them in all of our tenants::<br>
<br>
### main.yaml (continued)<br>
- tenant:<br>
name: openstack-infra<br>
include:<br>
- global_config.yaml<br>
- infra.yaml<br>
<br>
A tenant may optionally specify repos from which it may derive its<br>
configuration. In this manner, a repo may keep its Zuul configuration<br>
within its own repo. This would only happen if the main configuration<br>
file specified that it is permitted::<br>
<br>
### main.yaml (continued)<br>
- tenant:<br>
name: random-stackforge-project<br>
include:<br>
- global_config.yaml<br>
repos:<br>
- stackforge/random # Specific project config is in-repo<br>
<br>
Jobs defined in-repo may not have access to the full feature set<br>
(including some authorization features). They also may not override<br>
existing jobs.<br>
<br>
Job definitions continue to have the features in the current Zuul<br>
layout, but they also take on some of the responsibilities currently<br>
handled by the Jenkins (or other worker) definition::<br>
<br>
### global_config.yaml<br>
# Every tenant in the system has access to these jobs (because their<br>
# tenant definition includes it).<br>
- job:<br>
name: base<br>
timeout: 30m<br>
node: precise # Just a variable for later use<br>
nodes: # The operative list of nodes<br>
- name: controller<br>
image: {node} # Substitute the variable<br>
auth: # Auth may only be defined in central config, not in-repo<br>
swift:<br>
- container: logs<br>
pre-run: # These specify what to run before and after the job<br>
- zuul-cloner<br>
post-run:<br>
- archive-logs<br>
<br>
Jobs have inheritance, and the above definition provides a base level<br>
of functionality for all jobs. It sets a default timeout, requests a<br>
single node (of type precise), and requests swift credentials to<br>
upload logs. Further jobs may extend and override these parameters::<br>
<br>
### global_config.yaml (continued)<br>
# The python 2.7 unit test job<br>
- job:<br>
name: python27<br>
parent: base<br>
node: trusty<br>
<br>
Our use of job names specific to projects is a holdover from when we<br>
wanted long-lived slaves on jenkins to efficiently re-use workspaces.<br>
This hasn't been necessary for a while, though we have used this to<br>
our advantage when collecting stats and reports. However, job<br>
configuration can be simplified greatly if we simply have a job that<br>
runs the python 2.7 unit tests which can be used for any project. To<br>
the degree that we want to know how often this job failed on nova, we<br>
can add that information back in when reporting statistics. Jobs may<br>
have multiple aspects to accomodate differences among branches, etc.::<br>
<br>
### global_config.yaml (continued)<br>
# Version that is run for changes on stable/icehouse<br>
- job:<br>
name: python27<br>
parent: base<br>
branch: stable/icehouse<br>
node: precise<br>
<br>
# Version that is run for changes on stable/juno<br>
- job:<br>
name: python27<br>
parent: base<br>
branch: stable/juno # Could be combined into previous with regex<br>
node: precise # if concept of "best match" is defined<br>
<br>
Jobs may specify that they require more than one node::<br>
<br>
### global_config.yaml (continued)<br>
- job:<br>
name: devstack-multinode<br>
parent: base<br>
node: trusty # could do same branch mapping as above<br>
nodes:<br>
- name: controller<br>
image: {node}<br>
- name: compute<br>
image: {node}<br>
<br>
Jobs defined centrally (i.e., not in-repo) may specify auth info::<br>
<br>
### global_config.yaml (continued)<br>
- job:<br>
name: pypi-upload<br>
parent: base<br>
auth:<br>
password:<br>
pypi-password: pypi-password<br>
# This looks up 'pypi-password' from an encrypted yaml file<br>
# and adds it into variables for the job<br>
<br>
Pipeline definitions are similar to the current syntax, except that it<br>
supports specifying additional information for jobs in the context of<br>
a given project and pipeline. For instance, rather than specifying<br>
that a job is globally non-voting, you may specify that it is<br>
non-voting for a given project in a given pipeline::<br>
<br>
### openstack.yaml<br>
- project:<br>
name: openstack/nova<br>
gate:<br>
queue: integrated # Shared queues are manually built<br>
jobs:<br>
- python27 # Runs version of job appropriate to branch<br>
- devstack<br>
- devstack-deprecated-feature:<br>
branch: stable/juno # Only run on stable/juno changes<br>
voting: false # Non-voting<br>
post:<br>
jobs:<br>
- tarball:<br>
jobs:<br>
- pypi-upload<br>
<br>
Currently unique job names are used to build shared change queues.<br>
Since job names will no longer be unique, shared queues must be<br>
manually constructed by assigning them a name. Projects with the same<br>
queue name for the same pipeline will have a shared queue.<br>
<br>
A subset of functionality is avaible to projects that are permitted to<br>
use in-repo configuration::<br>
<br>
### stackforge/random/.zuul.yaml<br>
- job:<br>
name: random-job<br>
parent: base # From global config; gets us logs<br>
node: precise<br>
<br>
- project:<br>
name: stackforge/random<br>
gate:<br>
jobs:<br>
- python27 # From global config<br>
- random-job # Flom local config<br>
<br>
The executable content of jobs should be defined as ansible playbooks.<br>
Playbooks can be fairly simple and might consist of little more than<br>
"run this shell script" for those who are not otherwise interested in<br>
ansible::<br>
<br>
### stackforge/random/playbooks/random-job.yaml<br>
---<br>
hosts: controller<br>
tasks:<br>
- shell: run_some_tests.sh<br>
<br>
Global jobs may define ansible roles for common functions::<br>
<br>
### openstack-infra/zuul-playbooks/python27.yaml<br>
---<br>
hosts: controller<br>
roles:<br>
- tox:<br>
env: py27<br>
<br>
Because ansible has well-articulated multi-node orchestration<br>
features, this permits very expressive job definitions for multi-node<br>
tests. A playbook can specify different roles to apply to the<br>
different nodes that the job requested::<br>
<br>
### openstack-infra/zuul-playbooks/devstack-multinode.yaml<br>
---<br>
hosts: controller<br>
roles:<br>
- devstack<br>
---<br>
hosts: compute<br>
roles:<br>
- devstack-compute<br>
<br>
Additionally, if a project is already defining ansible roles for its<br>
deployment, then those roles may be easily applied in testing, making<br>
CI even closer to CD. Finally, to make Zuul more useful for CD, Zuul<br>
may be configured to run a job (ie, ansible role) on a specific node.<br>
<br>
The pre- and post-run entries in the job definition might also apply<br>
to ansible playbooks and can be used to simplify job setup and<br>
cleanup::<br>
<br>
### openstack-infra/zuul-playbooks/zuul-cloner.yaml<br>
---<br>
hosts: all<br>
roles:<br>
- zuul-cloner: {{zuul}}<br>
<br>
Where the zuul variable is a dictionary containing all the information<br>
currently transmitted in the ZUUL_* environment variables. Similarly,<br>
the log archiving script can copy logs from the host to swift.<br>
<br>
A new Zuul component would be created to execute jobs. Rather than<br>
running a worker process on each node (which requires installing<br>
software on the test node, and establishing and maintaining network<br>
connectivity back to Zuul, and the ability to coordinate actions across<br>
nodes for multi-node tests), this new component will accept jobs from<br>
Zuul, and for each one, write an ansible inventory file with the node<br>
and variable information, and then execute the ansible playbook for that<br>
job. This means that the new Zuul component will maintain ssh<br>
connections to all hosts currently running a job. This could become a<br>
bottleneck, but ansible and ssh have been known to scale to a large<br>
number of simultaneous hosts, and this component may be scaled<br>
horizontally. It should be simple enough that it could even be<br>
automatically scaled if needed. In turn, however, this does make node<br>
configuration simpler (test nodes need only have an ssh public key<br>
installed) and makes tests behave more like deployment.<br>
<br>
_______________________________________________<br>
OpenStack-Infra mailing list<br>
<a href="mailto:OpenStack-Infra@lists.openstack.org">OpenStack-Infra@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra</a><br>
</blockquote></div><br></div>