[oslo][tooz][openstack-ansible] Discussion about coordination (tooz), too many backend options, their state and deployment implications
Christian Rohmann
christian.rohmann at inovex.de
Mon Oct 31 11:29:11 UTC 2022
Hallo openstack-discuss,
apologies for this being quite a long message - I tried my best to
collect my thoughts on the matter.
1) The role of deployment tooling in fulfilling the requirement for a
coordination backend
I honestly write this, triggered by openstack-ansible plans to add
coordination via the Zookeeper backend
(https://lists.openstack.org/pipermail/openstack-discuss/2022-October/031013.html).
On 27/10/2022 13:10, Dmitriy Rabotyagov wrote:
> * Add Zookepeer cluster deployment as coordination service.
> Coordination is required if you want to have active/active
> cinder-volume setup and also actively used by other projects, like
> Octavia or Designate. Zookeeper will be deployed in a separate set of
> containers for LXC path
First of all I believe it's essential for any OpenStack deployment
tooling to handle the deployment of a coordination backend as many OS
projects just rely in their design and code to have it in place.
But I am convinced though there too many options, that some stronger
guidance should be given to people designing and then deploying OS for
their platform.
This guidance certainly can be in the form of a comparison table - but
when it comes to using deployment tooling like openstack-ansible,
the provided "default" component or options for something might just be
worth more than written text explaining all of the possible approaches.
This hold especially true to me as you can get quite far with no
coordination configured which then results in frustration and invalid
bugs being raised.
And it's not just openstack-ansible thinking about coordination
deployment / configurations. To just point to a few:
* Kolla-Ansible:
https://lists.openstack.org/pipermail/openstack-discuss/2020-November/018838.html
* Charms: https://bugs.launchpad.net/charm-designate/+bug/1759597
* Puppet: https://review.opendev.org/c/openstack/puppet-oslo/+/791628/
* ...
2) Choosing the "right" backend driver
I've recently been looking into the question what would be the "best"
tooz driver to cover all coordination use cases
the various OS projects require. Yes, the dependencies and use of
coordination within the OS projects (cinder, designate, gnocchi, ...)
are very different.
I don't want to sound unfair, but most don't communicate which of the
Tooz services they actually require. In any case, I might just like
something to cover all possible
requirements to "set it (up) and forget it, no matter what OS projects
run on the platform.
Apart from basic compatibility, there are qualities I would expect (in
no particular order) from a coordination backend:
* no "best-effort" coordination, but allowing for actual reliance on
it (CP if talking CAP)
* HA - this needs to be working just as reliably as my database as
otherwise the cloud cannot function
* efficient in getting the job done (e.g. support for events / watches
to reduce latency)
* lightweight (deployment), no additional components, readily packaged
* very little maintenance operations
* easy monitoring
I started by reading into the tooz drivers
(https://docs.openstack.org/tooz/latest/user/drivers.html),
of which there are more than enough to require some research. Here are
my rough thoughts:
a) I ruled out the IPC, file or RDBMs (mysql, postgresql) backend
options as they all have strong side-notes (issues when doing
replication or no HA at all).
Additionally they usually are not partition tolerant or support watches.
b) Redis seems quite capable, but there are many side notes about HA
and this also requires setting up and maintaining sentinel.
c) Memcached supports all three services (locks, groups,
leader-election) tooz provides and is usually already part of an
OpenStack infrastructure. So looked appealing.
But it's non-replicating architecture and lack of any strong consistency
guarantees make it less of a good "standard". I was even wondering how
tooz would try it's best to work with multiple memcached nodes
(https://bugs.launchpad.net/python-tooz/+bug/1970659).
d) Then there only is Zookeeper left, which also ticks all the
(feature-)boxes
(https://docs.openstack.org/tooz/latest/user/compatibility.html) and is
quite a proven tool for coordination also outside of the OpenStack
ecosystem.
On the downside it's not really that well known and common (anymore)
outside the "data processing" context (see
https://cwiki.apache.org/confluence/display/ZOOKEEPER/PoweredBy).
Being a Java application it requires a JVM and its dependencies and is
quite memory heavy to store just a few megabytes of config data. Looking
at more and more people putting their OS control plane into something
like Kubernetes it also seems even less suitable to be "moved around" a
lot. Another issue might be the lack of a recent and non-EoL version
packaged in Ubuntu - see
https://bugs.launchpad.net/ubuntu/+source/zookeeper/+bug/1854331. Maybe
(!) this could be an indication of how commonly it is used outside of
e.g. Support from TLS was only added in 3.5.5
(https://zookeeper.apache.org/doc/r3.5.5/zookeeperAdmin.html#Quorum+TLS)
e) Consul - While also well known and loved, it has, like Zookeeper,
quite a big footprint and is way more than just a CP-focused database.
It's more of an application with man use cases.
f) A last "strong" candidate is etcd. It did not surprise me to see
it on the list of possible drivers and certainly is a tool known to many
from running e.g. Kubernetes. It's actually already part of
openstack-ansible deployment code as a role
(https://github.com/openstack/openstack-ansible/commit/2f240dd485b123763442aa94130c6ddd3109ce34)
as it is required when using Calico as SDN. While etcd is also something
one must know how to monitor and operate, I allow me to say it might
just be more common to find this operational knowledge. Also etcd has a
smaller footprint than Zookeeper and it beeing "just a Golang binary"
comes with (no) less dependencies. But I noticed that it does not even
support "grouping", according to the feature matrix. But apparently this
is just a documentation delay,
seehttps://bugs.launchpad.net/python-tooz/+bug/1995125. What's left to
implement would be leader-election, but there seems to be no technical
reason why this cannot be done.
this by no means is a comparison with a clear winner. I just want to
stress how confusing having lots of options with no
real guidance are. The requirement to chose and deploy coordination
might not be a focus when looking into designing an OS cloud.
3) Stronger guidance / "common default", setup via OS deployment
tooling and also used for DevStack and tested via CI
To summarize, there are just too many options and implications in the
compatibility list to quickly chose the "right" one for one's own
deployment.
While large-scale deployments might likely not mind for coordination to
have a bigger footprint and requiring more attention in general.
But for smaller and even mid-size deployments, it's just convenient to
offload the configuration of coordination and the selection the backend
driver to the deployment tooling.
Making it way too easy for such installations to not use coordination
and running into issues or every other installation using a different
backend creates a very fragmented landscape.
Add different operating system distributions and versions, different
deployment tooling, different set and versions of OS projects used,
there will be so many combinations.
This will likely just cause OS projects to receive more and
non-reproducible bugs. Also not having (a somewhat common) coordination
(backend) used within CI and DevStack does not expose
the relevant code paths to enough testing.
I'd like to make the analogy to having "just" MySQL as the default
database engine, while still allowing other engines to be used
(https://governance.openstack.org/tc/resolutions/20170613-postgresql-status.html).
Or labeling certain options as "experimental" as Neutron just did with
"linuxbridge"
(https://docs.openstack.org/neutron/latest//admin/config-experimental-framework.html)
or cinder with naming drivers unsupported
(https://docs.openstack.org/cinder/ussuri/drivers-all-about.html#unsupported-drivers).
My point is that just having all those backends and no active guidance
might make Tooz a very open and flexible component.
I myself would wish for some less confusion around this topic and having
a little less to think about this myself.
Maybe the "selection" of Zookeeper by openstack-ansible is just that?
I would love to hear your thoughts on coordination and why and how you
ended up with using what.
And certainly what your opinion on the matter of a stronger communicated
"default" is.
Thanks for your time and thoughts!
Christian
More information about the openstack-discuss
mailing list