Hallo openstack-discuss, apologies for this being quite a long message - I tried my best to collect my thoughts on the matter. 1) The role of deployment tooling in fulfilling the requirement for a coordination backend I honestly write this, triggered by openstack-ansible plans to add coordination via the Zookeeper backend (https://lists.openstack.org/pipermail/openstack-discuss/2022-October/031013....). On 27/10/2022 13:10, Dmitriy Rabotyagov wrote:
* Add Zookepeer cluster deployment as coordination service. Coordination is required if you want to have active/active cinder-volume setup and also actively used by other projects, like Octavia or Designate. Zookeeper will be deployed in a separate set of containers for LXC path
First of all I believe it's essential for any OpenStack deployment tooling to handle the deployment of a coordination backend as many OS projects just rely in their design and code to have it in place. But I am convinced though there too many options, that some stronger guidance should be given to people designing and then deploying OS for their platform. This guidance certainly can be in the form of a comparison table - but when it comes to using deployment tooling like openstack-ansible, the provided "default" component or options for something might just be worth more than written text explaining all of the possible approaches. This hold especially true to me as you can get quite far with no coordination configured which then results in frustration and invalid bugs being raised. And it's not just openstack-ansible thinking about coordination deployment / configurations. To just point to a few: * Kolla-Ansible: https://lists.openstack.org/pipermail/openstack-discuss/2020-November/018838... * Charms: https://bugs.launchpad.net/charm-designate/+bug/1759597 * Puppet: https://review.opendev.org/c/openstack/puppet-oslo/+/791628/ * ... 2) Choosing the "right" backend driver I've recently been looking into the question what would be the "best" tooz driver to cover all coordination use cases the various OS projects require. Yes, the dependencies and use of coordination within the OS projects (cinder, designate, gnocchi, ...) are very different. I don't want to sound unfair, but most don't communicate which of the Tooz services they actually require. In any case, I might just like something to cover all possible requirements to "set it (up) and forget it, no matter what OS projects run on the platform. Apart from basic compatibility, there are qualities I would expect (in no particular order) from a coordination backend: * no "best-effort" coordination, but allowing for actual reliance on it (CP if talking CAP) * HA - this needs to be working just as reliably as my database as otherwise the cloud cannot function * efficient in getting the job done (e.g. support for events / watches to reduce latency) * lightweight (deployment), no additional components, readily packaged * very little maintenance operations * easy monitoring I started by reading into the tooz drivers (https://docs.openstack.org/tooz/latest/user/drivers.html), of which there are more than enough to require some research. Here are my rough thoughts: a) I ruled out the IPC, file or RDBMs (mysql, postgresql) backend options as they all have strong side-notes (issues when doing replication or no HA at all). Additionally they usually are not partition tolerant or support watches. b) Redis seems quite capable, but there are many side notes about HA and this also requires setting up and maintaining sentinel. c) Memcached supports all three services (locks, groups, leader-election) tooz provides and is usually already part of an OpenStack infrastructure. So looked appealing. But it's non-replicating architecture and lack of any strong consistency guarantees make it less of a good "standard". I was even wondering how tooz would try it's best to work with multiple memcached nodes (https://bugs.launchpad.net/python-tooz/+bug/1970659). d) Then there only is Zookeeper left, which also ticks all the (feature-)boxes (https://docs.openstack.org/tooz/latest/user/compatibility.html) and is quite a proven tool for coordination also outside of the OpenStack ecosystem. On the downside it's not really that well known and common (anymore) outside the "data processing" context (see https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy). Being a Java application it requires a JVM and its dependencies and is quite memory heavy to store just a few megabytes of config data. Looking at more and more people putting their OS control plane into something like Kubernetes it also seems even less suitable to be "moved around" a lot. Another issue might be the lack of a recent and non-EoL version packaged in Ubuntu - see https://bugs.launchpad.net/ubuntu/+source/zookeeper/+bug/1854331. Maybe (!) this could be an indication of how commonly it is used outside of e.g. Support from TLS was only added in 3.5.5 (https://zookeeper.apache.org/doc/r3.5.5/zookeeperAdmin.html#Quorum+TLS) e) Consul - While also well known and loved, it has, like Zookeeper, quite a big footprint and is way more than just a CP-focused database. It's more of an application with man use cases. f) A last "strong" candidate is etcd. It did not surprise me to see it on the list of possible drivers and certainly is a tool known to many from running e.g. Kubernetes. It's actually already part of openstack-ansible deployment code as a role (https://github.com/openstack/openstack-ansible/commit/2f240dd485b123763442aa...) as it is required when using Calico as SDN. While etcd is also something one must know how to monitor and operate, I allow me to say it might just be more common to find this operational knowledge. Also etcd has a smaller footprint than Zookeeper and it beeing "just a Golang binary" comes with (no) less dependencies. But I noticed that it does not even support "grouping", according to the feature matrix. But apparently this is just a documentation delay, seehttps://bugs.launchpad.net/python-tooz/+bug/1995125. What's left to implement would be leader-election, but there seems to be no technical reason why this cannot be done. this by no means is a comparison with a clear winner. I just want to stress how confusing having lots of options with no real guidance are. The requirement to chose and deploy coordination might not be a focus when looking into designing an OS cloud. 3) Stronger guidance / "common default", setup via OS deployment tooling and also used for DevStack and tested via CI To summarize, there are just too many options and implications in the compatibility list to quickly chose the "right" one for one's own deployment. While large-scale deployments might likely not mind for coordination to have a bigger footprint and requiring more attention in general. But for smaller and even mid-size deployments, it's just convenient to offload the configuration of coordination and the selection the backend driver to the deployment tooling. Making it way too easy for such installations to not use coordination and running into issues or every other installation using a different backend creates a very fragmented landscape. Add different operating system distributions and versions, different deployment tooling, different set and versions of OS projects used, there will be so many combinations. This will likely just cause OS projects to receive more and non-reproducible bugs. Also not having (a somewhat common) coordination (backend) used within CI and DevStack does not expose the relevant code paths to enough testing. I'd like to make the analogy to having "just" MySQL as the default database engine, while still allowing other engines to be used (https://governance.openstack.org/tc/resolutions/20170613-postgresql-status.h...). Or labeling certain options as "experimental" as Neutron just did with "linuxbridge" (https://docs.openstack.org/neutron/latest//admin/config-experimental-framewo...) or cinder with naming drivers unsupported (https://docs.openstack.org/cinder/ussuri/drivers-all-about.html#unsupported-...). My point is that just having all those backends and no active guidance might make Tooz a very open and flexible component. I myself would wish for some less confusion around this topic and having a little less to think about this myself. Maybe the "selection" of Zookeeper by openstack-ansible is just that? I would love to hear your thoughts on coordination and why and how you ended up with using what. And certainly what your opinion on the matter of a stronger communicated "default" is. Thanks for your time and thoughts! Christian