[OpenStack-Infra] Thoughts on evolving Zuul

Monty Taylor mordred at inaugust.com
Thu Feb 26 23:04:11 UTC 2015


On 02/26/2015 05:41 PM, Zaro wrote:
> Thanks Jim.  This makes a lot of sense and will hopefully make things
> simpler and more robust.
> 
> Just a few questions:

I am not Jim - but I'm going to answer anyway ...

> 1.  It looks like zuul can request a specific set of nodes for a job.  Do
> you envision the typical ansible playbook to install additional things
> required for the jobs? or would zuul always need to request a suitable node
> for the job?

I think we're leaning towards fewer types of images - so I'd expect
playbooks to specialize an image as step one. This is, of course,
similar to what many jobs do today already, with installation steps
before revoke-sudo is run.

> 2. Would there be a way to share environment variables across multiple
> shell tasks?  For example would it be possible to reference a variable
> defined in the job yaml file from inside of a shell script?

Yes - although it might not be specifically environment variables.
Sharing variables from one task to another or taking the output
variables from one task and referencing them as part of a subsequent
task are well supported (I can even show you examples of doing this in
the recent launch_node work if you wanna see what it looks like)

> -Khai
> 
> 
> On Thu, Feb 26, 2015 at 8:59 AM, James E. Blair <corvus at inaugust.com> wrote:
> 
>> Hi,
>>
>> I've been wanting to make some structural changes to Zuul to round it
>> out into a coherent system.  I don't want to change it too much, but I'd
>> also like a clean break with some of the baggage we've been carrying
>> around from earlier decisions, and I want it to be able to continue to
>> scale up (the config in particular is getting hard to manage with >500
>> projects).
>>
>> I've batted a few ideas around with Monty, and I've written up my
>> thoughts below.  This is mostly a narrative exploration of what I think
>> it should look like.  This is not exhaustive, but I think it explores
>> most of the major ideas.  The next step is to turn this into a spec and
>> start iterating on it and getting more detailed.
>>
>> I'm posting this here first for discussion to see if there are any
>> major conceptual things that we should address before we get into more
>> detailed spec review.  Please let me know what you think.
>>
>> -Jim
>>
>> =======
>>  Goals
>> =======
>>
>> Make zuul scale to thousands of projects.
>> Make Zuul more multi-tenant friendly.
>> Make it easier to express complex scenarios in layout.
>> Make nodepool more useful for non virtual nodes.
>> Make nodepool more efficient for multi-node tests.
>> Remove need for long-running slaves.
>> Make it easier to use Zuul for continuous deployment.
>>
>> To accomplish this, changes to Zuul's configuration syntax are
>> proposed, making it simpler to manage large number of jobs and
>> projects, along with a new method of describing and running jobs, and
>> a new system for node distribution with Nodepool.
>>
>> =====================
>>  Changes To Nodepool
>> =====================
>>
>> Nodepool should be made to support explicit node requests and
>> releases.  That is to say, it should act more like its name -- a node
>> pool.
>>
>> Rather than having servers add themselves to the pool by waiting for
>> them (or Jenkins on their behalf) to register with gearman, nodepool
>> should instead define functions to supply nodes on demand.  For
>> example it might define the gearman functions "get-nodes" and
>> "put-nodes".  Zuul might request a node for a job by submitting a
>> "get-nodes" job with the node type (eg "precise") as an argument.  It
>> could request two nodes together (in the same AZ) by supplying more
>> than one node type in the same call.  When complete, it could call
>> "put-nodes" with the node identifiers to instruct nodepool to return
>> them (nodepool might then delete, rebuild, etc).
>>
>> This model is much more efficient for multi-node tests, where we will
>> no longer need to have special multinode labels.  Instead the
>> multinode configuration can be much more ad-hoc and vary per job.
>>
>> The testenv broker used by tripleo behaves somewhat in this manner
>> (though it only supports static sets of resources).  It also has logic
>> to deal with the situation where Zuul might exit unexpectedly and not
>> return nodes (though it should strive to do so).  This feature in the
>> broker should be added to nodepool.  Additionally, nodepool should
>> support fully static resources (they should become just another node
>> type) so that it can handle the use case of the test broker.
>>
>> =================
>>  Changes To Zuul
>> =================
>>
>> Zuul is currently fundamentally a single-tenant application.  Some
>> folks want to use it in a multi-tenant environment.  Even within
>> OpenStack, we have use for multitenancy.  OpenStack might be one
>> tenant, and each stackforge project might be another.  Even if the big
>> tent discussion renders that thinking obsolete, we may still want the
>> kind of separation multi-tenancy can provide.  The proposed
>> implementation is flexible enough to run Zuul completely single tenant
>> with shared everything, completely multi-tenant with shared nothing, and
>> everything in-between.  Being able to adjust just how much is shared or
>> required, and how much can be left to individual projects will be very
>> useful.
>>
>> To support this, the main configuration should define tenants, and
>> tenants should specify config files to include.  These include files
>> should define pipelines, jobs, and projects, all of which are
>> namespaced to the tenant (so different tenants may have different jobs
>> with the same names)::
>>
>>   ### main.yaml
>>   - tenant:
>>       name: openstack
>>       include:
>>         - global_config.yaml
>>         - openstack.yaml
>>
>> Files may be included by more than one tenant, so common items can be
>> placed in a common file and referenced globally.  This means that for,
>> eg, OpenStack, we can define pipelines and our base job definitions
>> (with logging info, etc) once, and include them in all of our tenants::
>>
>>   ### main.yaml (continued)
>>   - tenant:
>>       name: openstack-infra
>>       include:
>>         - global_config.yaml
>>         - infra.yaml
>>
>> A tenant may optionally specify repos from which it may derive its
>> configuration.  In this manner, a repo may keep its Zuul configuration
>> within its own repo.  This would only happen if the main configuration
>> file specified that it is permitted::
>>
>>   ### main.yaml (continued)
>>   - tenant:
>>       name: random-stackforge-project
>>       include:
>>         - global_config.yaml
>>       repos:
>>         - stackforge/random  # Specific project config is in-repo
>>
>> Jobs defined in-repo may not have access to the full feature set
>> (including some authorization features).  They also may not override
>> existing jobs.
>>
>> Job definitions continue to have the features in the current Zuul
>> layout, but they also take on some of the responsibilities currently
>> handled by the Jenkins (or other worker) definition::
>>
>>   ### global_config.yaml
>>   # Every tenant in the system has access to these jobs (because their
>>   # tenant definition includes it).
>>   - job:
>>       name: base
>>       timeout: 30m
>>       node: precise   # Just a variable for later use
>>       nodes:  # The operative list of nodes
>>         - name: controller
>>           image: {node}  # Substitute the variable
>>       auth:  # Auth may only be defined in central config, not in-repo
>>         swift:
>>           - container: logs
>>       pre-run:  # These specify what to run before and after the job
>>         - zuul-cloner
>>       post-run:
>>         - archive-logs
>>
>> Jobs have inheritance, and the above definition provides a base level
>> of functionality for all jobs.  It sets a default timeout, requests a
>> single node (of type precise), and requests swift credentials to
>> upload logs.  Further jobs may extend and override these parameters::
>>
>>   ### global_config.yaml (continued)
>>   # The python 2.7 unit test job
>>   - job:
>>       name: python27
>>       parent: base
>>       node: trusty
>>
>> Our use of job names specific to projects is a holdover from when we
>> wanted long-lived slaves on jenkins to efficiently re-use workspaces.
>> This hasn't been necessary for a while, though we have used this to
>> our advantage when collecting stats and reports.  However, job
>> configuration can be simplified greatly if we simply have a job that
>> runs the python 2.7 unit tests which can be used for any project.  To
>> the degree that we want to know how often this job failed on nova, we
>> can add that information back in when reporting statistics.  Jobs may
>> have multiple aspects to accomodate differences among branches, etc.::
>>
>>   ### global_config.yaml (continued)
>>   # Version that is run for changes on stable/icehouse
>>   - job:
>>       name: python27
>>       parent: base
>>       branch: stable/icehouse
>>       node: precise
>>
>>   # Version that is run for changes on stable/juno
>>   - job:
>>       name: python27
>>       parent: base
>>       branch: stable/juno  # Could be combined into previous with regex
>>       node: precise        # if concept of "best match" is defined
>>
>> Jobs may specify that they require more than one node::
>>
>>   ### global_config.yaml (continued)
>>   - job:
>>       name: devstack-multinode
>>       parent: base
>>       node: trusty  # could do same branch mapping as above
>>       nodes:
>>         - name: controller
>>           image: {node}
>>         - name: compute
>>           image: {node}
>>
>> Jobs defined centrally (i.e., not in-repo) may specify auth info::
>>
>>   ### global_config.yaml (continued)
>>   - job:
>>       name: pypi-upload
>>       parent: base
>>       auth:
>>         password:
>>           pypi-password: pypi-password
>>           # This looks up 'pypi-password' from an encrypted yaml file
>>           # and adds it into variables for the job
>>
>> Pipeline definitions are similar to the current syntax, except that it
>> supports specifying additional information for jobs in the context of
>> a given project and pipeline.  For instance, rather than specifying
>> that a job is globally non-voting, you may specify that it is
>> non-voting for a given project in a given pipeline::
>>
>>   ### openstack.yaml
>>   - project:
>>       name: openstack/nova
>>       gate:
>>         queue: integrated  # Shared queues are manually built
>>         jobs:
>>           - python27  # Runs version of job appropriate to branch
>>           - devstack
>>           - devstack-deprecated-feature:
>>               branch: stable/juno  # Only run on stable/juno changes
>>               voting: false  # Non-voting
>>       post:
>>         jobs:
>>           - tarball:
>>               jobs:
>>                 - pypi-upload
>>
>> Currently unique job names are used to build shared change queues.
>> Since job names will no longer be unique, shared queues must be
>> manually constructed by assigning them a name.  Projects with the same
>> queue name for the same pipeline will have a shared queue.
>>
>> A subset of functionality is avaible to projects that are permitted to
>> use in-repo configuration::
>>
>>   ### stackforge/random/.zuul.yaml
>>   - job:
>>       name: random-job
>>       parent: base      # From global config; gets us logs
>>       node: precise
>>
>>   - project:
>>       name: stackforge/random
>>       gate:
>>         jobs:
>>           - python27    # From global config
>>           - random-job  # Flom local config
>>
>> The executable content of jobs should be defined as ansible playbooks.
>> Playbooks can be fairly simple and might consist of little more than
>> "run this shell script" for those who are not otherwise interested in
>> ansible::
>>
>>   ### stackforge/random/playbooks/random-job.yaml
>>   ---
>>   hosts: controller
>>   tasks:
>>     - shell: run_some_tests.sh
>>
>> Global jobs may define ansible roles for common functions::
>>
>>   ### openstack-infra/zuul-playbooks/python27.yaml
>>   ---
>>   hosts: controller
>>   roles:
>>     - tox:
>>         env: py27
>>
>> Because ansible has well-articulated multi-node orchestration
>> features, this permits very expressive job definitions for multi-node
>> tests.  A playbook can specify different roles to apply to the
>> different nodes that the job requested::
>>
>>   ### openstack-infra/zuul-playbooks/devstack-multinode.yaml
>>   ---
>>   hosts: controller
>>   roles:
>>     - devstack
>>   ---
>>   hosts: compute
>>   roles:
>>     - devstack-compute
>>
>> Additionally, if a project is already defining ansible roles for its
>> deployment, then those roles may be easily applied in testing, making
>> CI even closer to CD.  Finally, to make Zuul more useful for CD, Zuul
>> may be configured to run a job (ie, ansible role) on a specific node.
>>
>> The pre- and post-run entries in the job definition might also apply
>> to ansible playbooks and can be used to simplify job setup and
>> cleanup::
>>
>>   ### openstack-infra/zuul-playbooks/zuul-cloner.yaml
>>   ---
>>   hosts: all
>>   roles:
>>     - zuul-cloner: {{zuul}}
>>
>> Where the zuul variable is a dictionary containing all the information
>> currently transmitted in the ZUUL_* environment variables.  Similarly,
>> the log archiving script can copy logs from the host to swift.
>>
>> A new Zuul component would be created to execute jobs.  Rather than
>> running a worker process on each node (which requires installing
>> software on the test node, and establishing and maintaining network
>> connectivity back to Zuul, and the ability to coordinate actions across
>> nodes for multi-node tests), this new component will accept jobs from
>> Zuul, and for each one, write an ansible inventory file with the node
>> and variable information, and then execute the ansible playbook for that
>> job.  This means that the new Zuul component will maintain ssh
>> connections to all hosts currently running a job.  This could become a
>> bottleneck, but ansible and ssh have been known to scale to a large
>> number of simultaneous hosts, and this component may be scaled
>> horizontally.  It should be simple enough that it could even be
>> automatically scaled if needed.  In turn, however, this does make node
>> configuration simpler (test nodes need only have an ssh public key
>> installed) and makes tests behave more like deployment.
>>
>> _______________________________________________
>> OpenStack-Infra mailing list
>> OpenStack-Infra at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>>
> 
> 
> 
> _______________________________________________
> OpenStack-Infra mailing list
> OpenStack-Infra at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
> 




More information about the OpenStack-Infra mailing list