[kolla] Virtual PTG summary
mark at stackhpc.com
Fri May 31 18:21:27 UTC 2019
This week we held a virtual PTG. The etherpad  includes the agenda
and notes. Here is my summary of what was discussed, decisions and
We started the sessions with a visitor from Tripleo - Tengu - who is
working on validations .
Validations are a way of checking a control plane at various points in
its life cycle, usually to check that some operation was completed
successfully. Tengu described the project and explained how there was
interest in making it a more generic framework that could be used by
kolla ansible and other deployment projects. On our side there was no
objection but it would require an owner. I offered to help Tengu with
a PoC to see how it would work.
Next we moved onto general project matters.
We've seen a slow decline in contribution over the past few cycles, as
have many other projects. Part of this is natural for a more mature
project that now mostly does what is required of it. Still, its good
to reflect on sustainability, and plan ahead for leaner times. As a
project we have a huge support matrix, with hundreds of container
images, multiple OS distributions, different CPU architectures and a
plethora of configuration options. We decided that a good starting
point would be to define some categories of support:
* images we maintain
* images we more or less care about
* the rest (once broken we may or may not work on fixing)
The wording may need some work :) We can extend this also to OS
distros, CPU arches, service deployment, and features. A key factor
here will be the level of test coverage. We would document this on the
wiki or docs.o.o. Ultimately this is all best effort, given that we're
an open source project with a community of volunteers.
A way to improve sustainability we discussed is through trimming the
* OracleLinux - this is small but non-zero maintenance overhead. It's
a candidate because we don't see people using it and Oracle
maintainers left the project.
* Debian binary images - not seeing many users of these.
* Ceph - this is well used but does require maintenance. We support
integration with external ceph clusters, and have discussed switching
to recommending ceph-ansible, with a documented, automated and tested
* kolla-cli - this project was added a few cycles ago, then the
maintainers left the community. We haven't released it, CI is broken
and have heard no complaints.
For features which do not require much code we could disable testing,
and move them to an explicit 'unmaintained' status. Alternatively, we
could deprecate then remove them. I will follow up with emails asking
for community feedback on the impact of removal.
## Core team
We're looking out for potential core reviewers. If you'd like to join
the team, speak to one of us and we can help you on the path. The main
factor here is quality, thoughtful reviews.
Recently attendance at IRC meetings has been low. We decided to move
the meetings to #openstack-kolla to try to get more people involved.
Kayobe seeks to become an official project. This could be under kolla
project governance, or as a separate project - the main deciding
factor will be the kolla team. I will send an email about this
I would like us to agree as a team on some priorities for the Train
cycle. I will send out a separate mail about this.
# Kolla (images)
## Python 3 images
In the Train cycle all projects should be moving to support python 3
only. We therefore need to build images based on python 3. hrw has
been working on this  for Ubuntu and Debian source images.
We will also need to switch to python 3 for CentOS/RHEL source images.
This work needs an owner. A related issue is that of CentOS 8, which
will support only python 3. I will follow up with a mail about this.
Binary images (RPM/deb) depend on the distros and their plans for python 3.
There is some python 3 work in kolla-ansible, to ensure we can execute
ansible via python both locally and on remote hosts.
## Health checks
Tripleo builds some healthchecks  into their docker images. This
allows Docker to restart a service if it is deemed to be unhealthy. We
agreed it would be nice to see support for this in kolla but did not
discuss in depth.
## Ansible version upgrade
Ansible moves on, and so must we. The version of ansible in
kolla_toolbox is now rather old (2.2), and our minimum ansible version
for kolla-ansible (2.5) is also getting old. It is likely we will need
to update some of the custom ansible modules in order to do this. We
may be able to replace others with modules from Ansible (e.g.
kolla_keystone_user, kolla_keystone_service). mnasiadka offered to
pick this up after the Ceph Nautilus upgrade.
## Fluentd upgrade
The fluentd service needs an update - we are using 0.12.something.
Needs an owner.
## Machine readable image config
The issues this would solve have now mostly been fixed in
tools/version-check.py, so we'll probably leave it.
## Buildah support
At the Denver summit there was some interest in support for buildah as
an alternative to docker for building images. I am led to believe
tripleo already does this, so will ask how they do it.
# Kolla Ansible
## Test coverage
We made some good progress on improving CI test coverage during the
Stein cycle, adding these jobs:
* Cinder LVM
* Scenario NFV
* Major version upgrades
During the Train cycle we aim to add these:
* MariaDB: https://review.opendev.org/655663
* Monasca: https://review.opendev.org/649893 (WIP)
* Ironic: https://review.opendev.org/568829
* Ceph upgrade: https://review.opendev.org/658132
* Tempest: https://review.opendev.org/402122 (WIP)
Other candidates include magnum, octavia and prometheus.
## Nova cells v2
We had a long discussion about nova cells v2 . Thanks to jroll and
johnthetubaguy for getting involved. We generally agreed that this
should be part of a wider assessment of scalability in kolla-ansible,
although I was keen to treat different aspects of this separately
There are a number of ways to approach a multi-cell deployment,
particularly in relation to how the per-cell infrastructure (DB, MQ,
nova-conductor) are deployed. We discussed building a flexible
mechanism for stamping out this infrastructure to arbitrary locations,
then being able to point services in each cell at a given
In an effort to make this as simple to use for deployers as possible,
we discussed defining a reference large scale cloud architecture. Our
first pass at this is as follows:
* API controllers x3+: APIs, super conductors, Galera, RabbitMQ, Keystone, etc
* Cell controller clusters x3+: cell conductors, Galera, RabbitMQ,
Glance, maybe neutron? (or one per "failure domain / AZ")
* Cell computes: nova compute, neutron agents
We would aim to add a CI job based on a two cell cloud using this architecture.
We moved on to operations in a multi-cell world. It should be easy to
add API controllers, cell controllers and compute nodes. We would want
the ability to operate on the cloud as a whole, or individual cells
Upgrades seem likely to introduce challenges, particularly if we want
to upgrade cells individually. It's likely CERN and others have some
good experience we could benefit from here.
The spec  needs an update based on this discussion, but comments
there are welcome.
We ran out of time before discussing the other kolla-ansible issues.
Please update the etherpad  if you have thoughts on any of these.
Thanks to everyone who attended the kolla PTG. A conference call is
certainly more challenging than a face to face dIscussion, but it was
also nice to not have to fly people around the world for a few design
discussions. I feel we did make progress on a number of important
issues. If anyone has feedback on how we could improve next time,
please get in touch.
More information about the openstack-discuss