On Thu, 14 Dec 2023 at 10:28, Eugen Block <eblock@nde.ag> wrote:
Interesting, I have a kolla-ansible one-node cluster with Antelope and
there I see what you describe as well. So the behavior did indeed
change. I guess the docs should be updated and contain read-only rbd
profile for glance.
This sounds like regression to me.
Maybe, see below.
We debated about it a lot when Ceph broke their backwards compatibility on deletes and I'm pretty sure, if my memory serves me right, that we found a solution in the Ceph store driver to not need the permissions to other pools. There really is no excuse why Glance should have read access to volume data or Nova Ephemeral data.
In preparation to write up some PRs for the documentation, I dug a little deeper and made the following observations:
a) Updating the ceph auth caps of the Glance user to
ceph auth caps client.glance mon 'profile rbd' mgr 'profile rbd
pool=images' osd 'allow class-read object_prefix rbd_children,
profile rbd pool=images'
as is used by ceph-ansible [1] and other deployers does NOT fix
the issue with listing children for images:
--- cut ---
# rbd -n client.glance -k /etc/ceph/ceph.client.glance.keyring -p
images children 85ffc293-6f9f-4cba-b75a-38d9b26eb0e3
rbd: listing children failed: (1) Operation not permitted
2023-12-20T14:49:52.845+0000 7ff93ea544c0 -1 librbd::api::Image:
list_images_v2: error listing image in directory: (1) Operation
not permitted
2023-12-20T14:49:52.845+0000 7ff93ea544c0 -1 librbd::api::Image:
list_descendants: error listing v2 images: (1) Operation not
permitted
--- cut ---
b) While Ceph does indeed document rados objects with prefix
"rbd_children" at [2] in regards to parent child relationships of
images, it seems to now be enough to satisfy the rados operations
the list_children method requires.
Adding rbd_directory and rbd_trash via
--- cut ---
# ceph auth caps client.glance mon 'profile rbd' mgr 'profile rbd
pool=images' osd 'allow class-read object_prefix rbd_directory,
allow class-read object_prefix rbd_trash, profile rbd pool=images'
updated caps for client.glance
--- cut ---
does fix this though
--- cut ---
# rbd -n client.glance -k /etc/ceph/ceph.client.glance.keyring -p
images children 85ffc293-6f9f-4cba-b75a-38d9b26eb0e3
volumes/volume-46481ff8-1b9e-4215-8b63-62f8d996fecc
--- cut ---
c) As far as the regression goes, I believe this could be due to
the list_children method being updated over the releases, now
fetching and returning more info on the children?
See [3] and [4] for those changes.
d) But any OpenStack deployment still with read-access on the
volumes pool will not observe this issue. Also Glance API
responding to the image delete request with a 500 and not 400
error is not really a big issue for most users (deletion was
rejected), it's hard to say when this became a bug.
e) Instead of try and error on the "rados_*"-prefixed object
required, maybe it makes sense to have someone from Ceph look into
this to define which caps are actually required to allow for
list_children on RBD images with children in other pools?
Regards
Christian
[1]
https://github.com/ceph/ceph-ansible/blob/b6102975549d8f870b0c20a01edda59d6ceac422/group_vars/all.yml.sample#L642
[2]
https://docs.ceph.com/en/latest/dev/rbd-layering/#parent-child-relationships
[3]
https://github.com/ceph/ceph/blame/main/src/librbd/librbd.cc#L2177
[4]
https://github.com/ceph/ceph/commit/3d5f055a0796c4e059c22b46f6f1b840bb9d10ef