<div dir="ltr"><span style="font-size:12.7272720336914px">Thanks Jim.  This makes a lot of sense and will hopefully make things simpler and more robust.  </span><div style="font-size:12.7272720336914px"><br></div><div style="font-size:12.7272720336914px">Just a few questions:<div><div>1.  It looks like zuul can request a specific set of nodes for a job.  Do you envision the typical ansible playbook to install additional things required for the jobs? or would zuul always need to request a suitable node for the job?</div><div>2. Would there be a way to share environment variables across multiple shell tasks?  For example would it be possible to reference a variable defined in the job yaml file from inside of a shell script? </div><div><br></div><div>-Khai</div></div><div><br></div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Feb 26, 2015 at 8:59 AM, James E. Blair <span dir="ltr"><<a href="mailto:corvus@inaugust.com" target="_blank">corvus@inaugust.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

I've been wanting to make some structural changes to Zuul to round it<br>

out into a coherent system.  I don't want to change it too much, but I'd<br>

also like a clean break with some of the baggage we've been carrying<br>

around from earlier decisions, and I want it to be able to continue to<br>

scale up (the config in particular is getting hard to manage with >500<br>

projects).<br>

<br>

I've batted a few ideas around with Monty, and I've written up my<br>

thoughts below.  This is mostly a narrative exploration of what I think<br>

it should look like.  This is not exhaustive, but I think it explores<br>

most of the major ideas.  The next step is to turn this into a spec and<br>

start iterating on it and getting more detailed.<br>

<br>

I'm posting this here first for discussion to see if there are any<br>

major conceptual things that we should address before we get into more<br>

detailed spec review.  Please let me know what you think.<br>

<br>

-Jim<br>

<br>

=======<br>

 Goals<br>

=======<br>

<br>

Make zuul scale to thousands of projects.<br>

Make Zuul more multi-tenant friendly.<br>

Make it easier to express complex scenarios in layout.<br>

Make nodepool more useful for non virtual nodes.<br>

Make nodepool more efficient for multi-node tests.<br>

Remove need for long-running slaves.<br>

Make it easier to use Zuul for continuous deployment.<br>

<br>

To accomplish this, changes to Zuul's configuration syntax are<br>

proposed, making it simpler to manage large number of jobs and<br>

projects, along with a new method of describing and running jobs, and<br>

a new system for node distribution with Nodepool.<br>

<br>

=====================<br>

 Changes To Nodepool<br>

=====================<br>

<br>

Nodepool should be made to support explicit node requests and<br>

releases.  That is to say, it should act more like its name -- a node<br>

pool.<br>

<br>

Rather than having servers add themselves to the pool by waiting for<br>

them (or Jenkins on their behalf) to register with gearman, nodepool<br>

should instead define functions to supply nodes on demand.  For<br>

example it might define the gearman functions "get-nodes" and<br>

"put-nodes".  Zuul might request a node for a job by submitting a<br>

"get-nodes" job with the node type (eg "precise") as an argument.  It<br>

could request two nodes together (in the same AZ) by supplying more<br>

than one node type in the same call.  When complete, it could call<br>

"put-nodes" with the node identifiers to instruct nodepool to return<br>

them (nodepool might then delete, rebuild, etc).<br>

<br>

This model is much more efficient for multi-node tests, where we will<br>

no longer need to have special multinode labels.  Instead the<br>

multinode configuration can be much more ad-hoc and vary per job.<br>

<br>

The testenv broker used by tripleo behaves somewhat in this manner<br>

(though it only supports static sets of resources).  It also has logic<br>

to deal with the situation where Zuul might exit unexpectedly and not<br>

return nodes (though it should strive to do so).  This feature in the<br>

broker should be added to nodepool.  Additionally, nodepool should<br>

support fully static resources (they should become just another node<br>

type) so that it can handle the use case of the test broker.<br>

<br>

=================<br>

 Changes To Zuul<br>

=================<br>

<br>

Zuul is currently fundamentally a single-tenant application.  Some<br>

folks want to use it in a multi-tenant environment.  Even within<br>

OpenStack, we have use for multitenancy.  OpenStack might be one<br>

tenant, and each stackforge project might be another.  Even if the big<br>

tent discussion renders that thinking obsolete, we may still want the<br>

kind of separation multi-tenancy can provide.  The proposed<br>

implementation is flexible enough to run Zuul completely single tenant<br>

with shared everything, completely multi-tenant with shared nothing, and<br>

everything in-between.  Being able to adjust just how much is shared or<br>

required, and how much can be left to individual projects will be very<br>

useful.<br>

<br>

To support this, the main configuration should define tenants, and<br>

tenants should specify config files to include.  These include files<br>

should define pipelines, jobs, and projects, all of which are<br>

namespaced to the tenant (so different tenants may have different jobs<br>

with the same names)::<br>

<br>

  ### main.yaml<br>

  - tenant:<br>

      name: openstack<br>

      include:<br>

        - global_config.yaml<br>

        - openstack.yaml<br>

<br>

Files may be included by more than one tenant, so common items can be<br>

placed in a common file and referenced globally.  This means that for,<br>

eg, OpenStack, we can define pipelines and our base job definitions<br>

(with logging info, etc) once, and include them in all of our tenants::<br>

<br>

  ### main.yaml (continued)<br>

  - tenant:<br>

      name: openstack-infra<br>

      include:<br>

        - global_config.yaml<br>

        - infra.yaml<br>

<br>

A tenant may optionally specify repos from which it may derive its<br>

configuration.  In this manner, a repo may keep its Zuul configuration<br>

within its own repo.  This would only happen if the main configuration<br>

file specified that it is permitted::<br>

<br>

  ### main.yaml (continued)<br>

  - tenant:<br>

      name: random-stackforge-project<br>

      include:<br>

        - global_config.yaml<br>

      repos:<br>

        - stackforge/random  # Specific project config is in-repo<br>

<br>

Jobs defined in-repo may not have access to the full feature set<br>

(including some authorization features).  They also may not override<br>

existing jobs.<br>

<br>

Job definitions continue to have the features in the current Zuul<br>

layout, but they also take on some of the responsibilities currently<br>

handled by the Jenkins (or other worker) definition::<br>

<br>

  ### global_config.yaml<br>

  # Every tenant in the system has access to these jobs (because their<br>

  # tenant definition includes it).<br>

  - job:<br>

      name: base<br>

      timeout: 30m<br>

      node: precise   # Just a variable for later use<br>

      nodes:  # The operative list of nodes<br>

        - name: controller<br>

          image: {node}  # Substitute the variable<br>

      auth:  # Auth may only be defined in central config, not in-repo<br>

        swift:<br>

          - container: logs<br>

      pre-run:  # These specify what to run before and after the job<br>

        - zuul-cloner<br>

      post-run:<br>

        - archive-logs<br>

<br>

Jobs have inheritance, and the above definition provides a base level<br>

of functionality for all jobs.  It sets a default timeout, requests a<br>

single node (of type precise), and requests swift credentials to<br>

upload logs.  Further jobs may extend and override these parameters::<br>

<br>

  ### global_config.yaml (continued)<br>

  # The python 2.7 unit test job<br>

  - job:<br>

      name: python27<br>

      parent: base<br>

      node: trusty<br>

<br>

Our use of job names specific to projects is a holdover from when we<br>

wanted long-lived slaves on jenkins to efficiently re-use workspaces.<br>

This hasn't been necessary for a while, though we have used this to<br>

our advantage when collecting stats and reports.  However, job<br>

configuration can be simplified greatly if we simply have a job that<br>

runs the python 2.7 unit tests which can be used for any project.  To<br>

the degree that we want to know how often this job failed on nova, we<br>

can add that information back in when reporting statistics.  Jobs may<br>

have multiple aspects to accomodate differences among branches, etc.::<br>

<br>

  ### global_config.yaml (continued)<br>

  # Version that is run for changes on stable/icehouse<br>

  - job:<br>

      name: python27<br>

      parent: base<br>

      branch: stable/icehouse<br>

      node: precise<br>

<br>

  # Version that is run for changes on stable/juno<br>

  - job:<br>

      name: python27<br>

      parent: base<br>

      branch: stable/juno  # Could be combined into previous with regex<br>

      node: precise        # if concept of "best match" is defined<br>

<br>

Jobs may specify that they require more than one node::<br>

<br>

  ### global_config.yaml (continued)<br>

  - job:<br>

      name: devstack-multinode<br>

      parent: base<br>

      node: trusty  # could do same branch mapping as above<br>

      nodes:<br>

        - name: controller<br>

          image: {node}<br>

        - name: compute<br>

          image: {node}<br>

<br>

Jobs defined centrally (i.e., not in-repo) may specify auth info::<br>

<br>

  ### global_config.yaml (continued)<br>

  - job:<br>

      name: pypi-upload<br>

      parent: base<br>

      auth:<br>

        password:<br>

          pypi-password: pypi-password<br>

          # This looks up 'pypi-password' from an encrypted yaml file<br>

          # and adds it into variables for the job<br>

<br>

Pipeline definitions are similar to the current syntax, except that it<br>

supports specifying additional information for jobs in the context of<br>

a given project and pipeline.  For instance, rather than specifying<br>

that a job is globally non-voting, you may specify that it is<br>

non-voting for a given project in a given pipeline::<br>

<br>

  ### openstack.yaml<br>

  - project:<br>

      name: openstack/nova<br>

      gate:<br>

        queue: integrated  # Shared queues are manually built<br>

        jobs:<br>

          - python27  # Runs version of job appropriate to branch<br>

          - devstack<br>

          - devstack-deprecated-feature:<br>

              branch: stable/juno  # Only run on stable/juno changes<br>

              voting: false  # Non-voting<br>

      post:<br>

        jobs:<br>

          - tarball:<br>

              jobs:<br>

                - pypi-upload<br>

<br>

Currently unique job names are used to build shared change queues.<br>

Since job names will no longer be unique, shared queues must be<br>

manually constructed by assigning them a name.  Projects with the same<br>

queue name for the same pipeline will have a shared queue.<br>

<br>

A subset of functionality is avaible to projects that are permitted to<br>

use in-repo configuration::<br>

<br>

  ### stackforge/random/.zuul.yaml<br>

  - job:<br>

      name: random-job<br>

      parent: base      # From global config; gets us logs<br>

      node: precise<br>

<br>

  - project:<br>

      name: stackforge/random<br>

      gate:<br>

        jobs:<br>

          - python27    # From global config<br>

          - random-job  # Flom local config<br>

<br>

The executable content of jobs should be defined as ansible playbooks.<br>

Playbooks can be fairly simple and might consist of little more than<br>

"run this shell script" for those who are not otherwise interested in<br>

ansible::<br>

<br>

  ### stackforge/random/playbooks/random-job.yaml<br>

  ---<br>

  hosts: controller<br>

  tasks:<br>

    - shell: run_some_tests.sh<br>

<br>

Global jobs may define ansible roles for common functions::<br>

<br>

  ### openstack-infra/zuul-playbooks/python27.yaml<br>

  ---<br>

  hosts: controller<br>

  roles:<br>

    - tox:<br>

        env: py27<br>

<br>

Because ansible has well-articulated multi-node orchestration<br>

features, this permits very expressive job definitions for multi-node<br>

tests.  A playbook can specify different roles to apply to the<br>

different nodes that the job requested::<br>

<br>

  ### openstack-infra/zuul-playbooks/devstack-multinode.yaml<br>

  ---<br>

  hosts: controller<br>

  roles:<br>

    - devstack<br>

  ---<br>

  hosts: compute<br>

  roles:<br>

    - devstack-compute<br>

<br>

Additionally, if a project is already defining ansible roles for its<br>

deployment, then those roles may be easily applied in testing, making<br>

CI even closer to CD.  Finally, to make Zuul more useful for CD, Zuul<br>

may be configured to run a job (ie, ansible role) on a specific node.<br>

<br>

The pre- and post-run entries in the job definition might also apply<br>

to ansible playbooks and can be used to simplify job setup and<br>

cleanup::<br>

<br>

  ### openstack-infra/zuul-playbooks/zuul-cloner.yaml<br>

  ---<br>

  hosts: all<br>

  roles:<br>

    - zuul-cloner: {{zuul}}<br>

<br>

Where the zuul variable is a dictionary containing all the information<br>

currently transmitted in the ZUUL_* environment variables.  Similarly,<br>

the log archiving script can copy logs from the host to swift.<br>

<br>

A new Zuul component would be created to execute jobs.  Rather than<br>

running a worker process on each node (which requires installing<br>

software on the test node, and establishing and maintaining network<br>

connectivity back to Zuul, and the ability to coordinate actions across<br>

nodes for multi-node tests), this new component will accept jobs from<br>

Zuul, and for each one, write an ansible inventory file with the node<br>

and variable information, and then execute the ansible playbook for that<br>

job.  This means that the new Zuul component will maintain ssh<br>

connections to all hosts currently running a job.  This could become a<br>

bottleneck, but ansible and ssh have been known to scale to a large<br>

number of simultaneous hosts, and this component may be scaled<br>

horizontally.  It should be simple enough that it could even be<br>

automatically scaled if needed.  In turn, however, this does make node<br>

configuration simpler (test nodes need only have an ssh public key<br>

installed) and makes tests behave more like deployment.<br>

<br>

_______________________________________________<br>

OpenStack-Infra mailing list<br>

<a href="mailto:OpenStack-Infra@lists.openstack.org">OpenStack-Infra@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra</a><br>

</blockquote></div><br></div>