New subject: [oslo][tooz][openstack-ansible] Discussion about coordination (tooz), too many backend options, their state and deployment implications

31 Oct 2022

      Hallo openstack-discuss,

apologies for this being quite a long message - I tried my best to 
collect my thoughts on the matter.

1) The role of deployment tooling in fulfilling the requirement for a 
coordination backend

I honestly write this, triggered by openstack-ansible plans to add 
coordination via the Zookeeper backend 
(https://lists.openstack.org/pipermail/openstack-discuss/2022-October/031013....).

On 27/10/2022 13:10, Dmitriy Rabotyagov wrote:
...
* Add Zookepeer cluster deployment as coordination service.
Coordination is required if you want to have active/active
cinder-volume setup and also actively used by other projects, like
Octavia or Designate. Zookeeper will be deployed in a separate set of
containers for LXC path
First of all I believe it's essential for any OpenStack deployment 
tooling to handle the deployment of a coordination backend as many OS 
projects just rely in their design and code to have it in place.
But I am convinced though there too many options, that some stronger 
guidance should be given to people designing and then deploying OS for 
their platform.
This guidance certainly can be in the form of a comparison table - but 
when it comes to using deployment tooling like openstack-ansible,
the provided "default" component or options for something might just be 
worth more than written text explaining all of the possible approaches.

This hold especially true to me as you can get quite far with no 
coordination configured which then results in frustration and invalid 
bugs being raised.
And it's not just openstack-ansible thinking about coordination 
deployment / configurations. To just point to a few:

  * Kolla-Ansible: 
https://lists.openstack.org/pipermail/openstack-discuss/2020-November/018838...
  * Charms: https://bugs.launchpad.net/charm-designate/+bug/1759597
  * Puppet: https://review.opendev.org/c/openstack/puppet-oslo/+/791628/
  * ...

2)  Choosing the "right" backend driver

I've recently been looking into the question what would be the "best" 
tooz driver to cover all coordination use cases
the various OS projects require. Yes, the dependencies and use of 
coordination within the OS projects (cinder, designate, gnocchi, ...) 
are very different.
I don't want to sound unfair, but most don't communicate which of the 
Tooz services they actually require. In any case, I might just like 
something to cover all possible
requirements to "set it (up) and forget it, no matter what OS projects 
run on the platform.

Apart from basic compatibility, there are qualities I would expect (in 
no particular order) from a coordination backend:

  * no "best-effort" coordination, but allowing for actual reliance on 
it (CP if talking CAP)
  * HA - this needs to be working just as reliably as my database as 
otherwise the cloud cannot function
  * efficient in getting the job done (e.g. support for events / watches 
to reduce latency)
  * lightweight (deployment), no additional components, readily packaged
  * very little maintenance operations
  * easy monitoring

I started by reading into the tooz drivers 
(https://docs.openstack.org/tooz/latest/user/drivers.html),
of which there are more than enough to require some research. Here are 
my rough thoughts:

   a) I ruled out the IPC, file or RDBMs (mysql, postgresql) backend 
options as they all have strong side-notes (issues when doing 
replication or no HA at all).
Additionally they usually are not partition tolerant or support watches.

   b) Redis seems quite capable, but there are many side notes about HA 
and this also requires setting up and maintaining sentinel.

   c) Memcached supports all three services (locks, groups, 
leader-election) tooz provides and is usually already part of an 
OpenStack infrastructure. So looked appealing.
But it's non-replicating architecture and lack of any strong consistency 
guarantees make it less of a good "standard". I was even wondering how 
tooz would try it's best to work with multiple memcached nodes 
(https://bugs.launchpad.net/python-tooz/+bug/1970659).

   d) Then there only is Zookeeper left, which also ticks all the 
(feature-)boxes 
(https://docs.openstack.org/tooz/latest/user/compatibility.html) and is 
quite a proven tool for coordination also outside of the OpenStack 
ecosystem.
On the downside it's not really that well known and common (anymore) 
outside the "data processing" context (see 
https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy).
Being a Java application it requires a JVM and its dependencies and is 
quite memory heavy to store just a few megabytes of config data. Looking 
at more and more people putting their OS control plane into something 
like Kubernetes it also seems even less suitable to be "moved around" a 
lot. Another issue might be the lack of a recent and non-EoL version 
packaged in Ubuntu - see 
https://bugs.launchpad.net/ubuntu/+source/zookeeper/+bug/1854331. Maybe 
(!) this could be an indication of how commonly it is used outside of 
e.g. Support from TLS was only added in 3.5.5 
(https://zookeeper.apache.org/doc/r3.5.5/zookeeperAdmin.html#Quorum+TLS)

   e) Consul - While also well known and loved, it has, like Zookeeper, 
quite a big footprint and is way more than just a CP-focused database. 
It's more of an application with man use cases.

   f) A last "strong" candidate is etcd. It did not surprise me to see 
it on the list of possible drivers and certainly is a tool known to many 
from running e.g. Kubernetes. It's actually already part of 
openstack-ansible deployment code as a role 
(https://github.com/openstack/openstack-ansible/commit/2f240dd485b123763442aa...) 
as it is required when using Calico as SDN. While etcd is also something 
one must know how to monitor and operate, I allow me to say it might 
just be more common to find this operational knowledge. Also etcd has a 
smaller footprint than Zookeeper and it beeing "just a Golang binary" 
comes with (no) less dependencies. But I noticed that it does not even 
support "grouping", according to the feature matrix. But apparently this 
is just a documentation delay, 
seehttps://bugs.launchpad.net/python-tooz/+bug/1995125. What's left to 
implement would be leader-election, but there seems to be no technical 
reason why this cannot be done.

this by no means is a comparison with a clear winner. I just want to 
stress how confusing having lots of options with no
real guidance are. The requirement to chose and deploy coordination 
might not be a focus when looking into designing an OS cloud.

3) Stronger guidance /  "common default", setup via OS deployment 
tooling and also used for DevStack and tested via CI

To summarize, there are just too many options and implications in the 
compatibility list to quickly chose the "right" one for one's own 
deployment.

While large-scale deployments might likely not mind for coordination to 
have a bigger footprint and requiring more attention in general.
But for smaller and even mid-size deployments, it's just convenient to 
offload the configuration of coordination and the selection the backend 
driver to the deployment tooling.
Making it way too easy for such installations to not use coordination 
and running into issues or every other installation using a different 
backend creates a very fragmented landscape.
Add different operating system distributions and versions, different 
deployment tooling, different set and versions of OS projects used, 
there will be so many combinations.
This will likely just cause OS projects to receive more and 
non-reproducible bugs. Also not having (a somewhat common) coordination 
(backend) used within CI and DevStack does not expose
the relevant code paths to enough testing.

I'd like to make the analogy to having "just" MySQL as the default 
database engine, while still allowing other engines to be used 
(https://governance.openstack.org/tc/resolutions/20170613-postgresql-status.h...).
Or labeling certain options as "experimental" as Neutron just did with 
"linuxbridge" 
(https://docs.openstack.org/neutron/latest//admin/config-experimental-framewo...) 
or cinder with naming drivers unsupported
(https://docs.openstack.org/cinder/ussuri/drivers-all-about.html#unsupported-...).

My point is that just having all those backends and no active guidance 
might make Tooz a very open and flexible component.
I myself would wish for some less confusion around this topic and having 
a little less to think about this myself.

Maybe the "selection" of Zookeeper by openstack-ansible is just that?

I would love to hear your thoughts on coordination and why and how you 
ended up with using what.
And certainly what your opinion on the matter of a stronger communicated 
"default" is.

Thanks for your time and thoughts!

Christian

[oslo][tooz][openstack-ansible] Discussion about coordination (tooz), too many backend options, their state and deployment implications

Christian Rohmann

Tobias Urdin

Christian Rohmann

Clark Boylan

tags

participants (3)