[kolla] Virtual PTG summary
Hi, This week we held a virtual PTG. The etherpad [1] includes the agenda and notes. Here is my summary of what was discussed, decisions and actions. # Cross-project We started the sessions with a visitor from Tripleo - Tengu - who is working on validations [2]. Validations are a way of checking a control plane at various points in its life cycle, usually to check that some operation was completed successfully. Tengu described the project and explained how there was interest in making it a more generic framework that could be used by kolla ansible and other deployment projects. On our side there was no objection but it would require an owner. I offered to help Tengu with a PoC to see how it would work. # General Next we moved onto general project matters. ## Sustainability We've seen a slow decline in contribution over the past few cycles, as have many other projects. Part of this is natural for a more mature project that now mostly does what is required of it. Still, its good to reflect on sustainability, and plan ahead for leaner times. As a project we have a huge support matrix, with hundreds of container images, multiple OS distributions, different CPU architectures and a plethora of configuration options. We decided that a good starting point would be to define some categories of support: * images we maintain * images we more or less care about * the rest (once broken we may or may not work on fixing) The wording may need some work :) We can extend this also to OS distros, CPU arches, service deployment, and features. A key factor here will be the level of test coverage. We would document this on the wiki or docs.o.o. Ultimately this is all best effort, given that we're an open source project with a community of volunteers. A way to improve sustainability we discussed is through trimming the feature matrix. * OracleLinux - this is small but non-zero maintenance overhead. It's a candidate because we don't see people using it and Oracle maintainers left the project. * Debian binary images - not seeing many users of these. * Ceph - this is well used but does require maintenance. We support integration with external ceph clusters, and have discussed switching to recommending ceph-ansible, with a documented, automated and tested migration path. * kolla-cli - this project was added a few cycles ago, then the maintainers left the community. We haven't released it, CI is broken and have heard no complaints. For features which do not require much code we could disable testing, and move them to an explicit 'unmaintained' status. Alternatively, we could deprecate then remove them. I will follow up with emails asking for community feedback on the impact of removal. ## Core team We're looking out for potential core reviewers. If you'd like to join the team, speak to one of us and we can help you on the path. The main factor here is quality, thoughtful reviews. ## Meetings Recently attendance at IRC meetings has been low. We decided to move the meetings to #openstack-kolla to try to get more people involved. ## Kayobe Kayobe seeks to become an official project. This could be under kolla project governance, or as a separate project - the main deciding factor will be the kolla team. I will send an email about this separately. ## Priorities I would like us to agree as a team on some priorities for the Train cycle. I will send out a separate mail about this. # Kolla (images) ## Python 3 images In the Train cycle all projects should be moving to support python 3 only. We therefore need to build images based on python 3. hrw has been working on this [3] for Ubuntu and Debian source images. We will also need to switch to python 3 for CentOS/RHEL source images. This work needs an owner. A related issue is that of CentOS 8, which will support only python 3. I will follow up with a mail about this. Binary images (RPM/deb) depend on the distros and their plans for python 3. There is some python 3 work in kolla-ansible, to ensure we can execute ansible via python both locally and on remote hosts. ## Health checks Tripleo builds some healthchecks [4] into their docker images. This allows Docker to restart a service if it is deemed to be unhealthy. We agreed it would be nice to see support for this in kolla but did not discuss in depth. ## Ansible version upgrade Ansible moves on, and so must we. The version of ansible in kolla_toolbox is now rather old (2.2), and our minimum ansible version for kolla-ansible (2.5) is also getting old. It is likely we will need to update some of the custom ansible modules in order to do this. We may be able to replace others with modules from Ansible (e.g. kolla_keystone_user, kolla_keystone_service). mnasiadka offered to pick this up after the Ceph Nautilus upgrade. ## Fluentd upgrade The fluentd service needs an update - we are using 0.12.something. Needs an owner. ## Machine readable image config The issues this would solve have now mostly been fixed in tools/version-check.py, so we'll probably leave it. ## Buildah support At the Denver summit there was some interest in support for buildah as an alternative to docker for building images. I am led to believe tripleo already does this, so will ask how they do it. # Kolla Ansible ## Test coverage We made some good progress on improving CI test coverage during the Stein cycle, adding these jobs: * Cinder LVM * Scenario NFV * Major version upgrades * Zun During the Train cycle we aim to add these: * MariaDB: https://review.opendev.org/655663 * Monasca: https://review.opendev.org/649893 (WIP) * Ironic: https://review.opendev.org/568829 * Ceph upgrade: https://review.opendev.org/658132 * Tempest: https://review.opendev.org/402122 (WIP) Other candidates include magnum, octavia and prometheus. ## Nova cells v2 We had a long discussion about nova cells v2 [5]. Thanks to jroll and johnthetubaguy for getting involved. We generally agreed that this should be part of a wider assessment of scalability in kolla-ansible, although I was keen to treat different aspects of this separately during development. There are a number of ways to approach a multi-cell deployment, particularly in relation to how the per-cell infrastructure (DB, MQ, nova-conductor) are deployed. We discussed building a flexible mechanism for stamping out this infrastructure to arbitrary locations, then being able to point services in each cell at a given infrastructure location. In an effort to make this as simple to use for deployers as possible, we discussed defining a reference large scale cloud architecture. Our first pass at this is as follows: * API controllers x3+: APIs, super conductors, Galera, RabbitMQ, Keystone, etc * Cell controller clusters x3+: cell conductors, Galera, RabbitMQ, Glance, maybe neutron? (or one per "failure domain / AZ") * Cell computes: nova compute, neutron agents We would aim to add a CI job based on a two cell cloud using this architecture. We moved on to operations in a multi-cell world. It should be easy to add API controllers, cell controllers and compute nodes. We would want the ability to operate on the cloud as a whole, or individual cells using --limit. Upgrades seem likely to introduce challenges, particularly if we want to upgrade cells individually. It's likely CERN and others have some good experience we could benefit from here. The spec [6] needs an update based on this discussion, but comments there are welcome. ## Others We ran out of time before discussing the other kolla-ansible issues. Please update the etherpad [1] if you have thoughts on any of these. # Thanks Thanks to everyone who attended the kolla PTG. A conference call is certainly more challenging than a face to face dIscussion, but it was also nice to not have to fly people around the world for a few design discussions. I feel we did make progress on a number of important issues. If anyone has feedback on how we could improve next time, please get in touch. Cheers, Mark [1] https://etherpad.openstack.org/p/kolla-train-ptg [2] https://docs.openstack.org/tripleo-validations/latest/readme.html [3] https://review.opendev.org/642375 [4] https://github.com/openstack/tripleo-common/tree/master/healthcheck [5] https://blueprints.launchpad.net/kolla-ansible/+spec/support-nova-cells [6] https://review.openstack.org/616645
participants (1)
-
Mark Goddard