On Tue, 17 Nov 2020 at 20:27, Radosław Piliszek <radoslaw.piliszek@gmail.com> wrote:
Dear Cinder Masters,
I have a question for you. (or two, or several; well, actually the whole Kolla team has :-) )
Thanks for kicking off this thread, Radek.
The background is that Kolla has been happily deploying multinode cinder-volume with Ceph RBD backend, with no coordination configured, cluster parameter unset, host properly set per host and backend_host normalised (as well as any other relevant config) between the cinder-volume hosts.
The first question is: do we correctly understand that this was an active-active deployment? Or really something else?
Now, there have been no reports that it misbehaved for anyone. It certainly has not for any Kolla core. The fact is it was brought to our attention because due to the drop of Kolla-deployed Ceph, the recommendation to set backend_host was not present and users tripped over non-uniform backend_host. And this is expected, of course.
Here is the bug report [1]. It relates to using an externally deployed Ceph cluster, rather than one deployed via Kolla Ansible. To provide a little more background, in Train and earlier releases we documented to set backend_host. From Ussuri, we automated more of the Ceph configuration, and in the process dropped backend_host. It's not clear why. Users upgrading to Ussuri from Train, and dropping their custom Cinder config in favour of the Kolla automation would lose backend_host, and therefore volumes would become unmanageable. A manual step is required to move them to one of the cinder-volume hosts. That bug caused us to question the active/active setup, especially after finding a related OSA bug [2]. I can't find any Cinder admin guide for active/active configuration, although there is a high level spec [3] (with linked sub-specs) and some contributor docs [4] that outline the various problems. [1] https://bugs.launchpad.net/kolla-ansible/+bug/1904062 [2] https://bugs.launchpad.net/openstack-ansible/+bug/1837403 [3] https://specs.openstack.org/openstack/cinder-specs/specs/mitaka/cinder-volum... [4] https://docs.openstack.org/cinder/latest/contributor/high_availability.html
The second and final question is, building up on the first one, were we doing it wrong all the time? (plus extras: Why did it work? Were there any quirks? What should we do?)
PS: Please let me know if this thought process is actually Ceph-independent as well.
-yoctozepto