Re: [cinder] Ceph, active-active and no coordination

18 Nov 2020

      On Tue, 17 Nov 2020 at 20:27, Radosław Piliszek
<radoslaw.piliszek@gmail.com> wrote:
...
Dear Cinder Masters,
I have a question for you. (or two, or several; well, actually the
whole Kolla team has :-) )
Thanks for kicking off this thread, Radek.
...
The background is that Kolla has been happily deploying multinode
cinder-volume with Ceph RBD backend, with no coordination configured,
cluster parameter unset, host properly set per host and backend_host
normalised (as well as any other relevant config) between the
cinder-volume hosts.
The first question is: do we correctly understand that this was an
active-active deployment? Or really something else?
Now, there have been no reports that it misbehaved for anyone. It
certainly has not for any Kolla core. The fact is it was brought to
our attention because due to the drop of Kolla-deployed Ceph, the
recommendation to set backend_host was not present and users tripped
over non-uniform backend_host. And this is expected, of course.
Here is the bug report [1]. It relates to using an externally deployed
Ceph cluster, rather than one deployed via Kolla Ansible.

To provide a little more background, in Train and earlier releases we
documented to set backend_host. From Ussuri, we automated more of the
Ceph configuration, and in the process dropped backend_host. It's not
clear why.

Users upgrading to Ussuri from Train, and dropping their custom Cinder
config in favour of the Kolla automation would lose backend_host, and
therefore volumes would become unmanageable. A manual step is required
to move them to one of the cinder-volume hosts.

That bug caused us to question the active/active setup, especially
after finding a related OSA bug [2].

I can't find any Cinder admin guide for active/active configuration,
although there is a high level spec [3] (with linked sub-specs) and
some contributor docs [4] that outline the various problems.

[1] https://bugs.launchpad.net/kolla-ansible/+bug/1904062
[2] https://bugs.launchpad.net/openstack-ansible/+bug/1837403
[3] https://specs.openstack.org/openstack/cinder-specs/specs/mitaka/cinder-volum...
[4] https://docs.openstack.org/cinder/latest/contributor/high_availability.html
...
The second and final question is, building up on the first one, were
we doing it wrong all the time?
(plus extras: Why did it work? Were there any quirks? What should we do?)
PS: Please let me know if this thought process is actually
Ceph-independent as well.
-yoctozepto

Re: [cinder] Ceph, active-active and no coordination

Mark Goddard