Dear all,
Lately, one of our clients stored 300k files in a manila cephfs share. Then he deleted the share in Manila. This event make the driver unresponsive for several hours until all the data was removed in the cluster.
We had a quick look at the code in manila [1] and the deletion is done first by calling the following api calls in the ceph bindings (delete_volume[1] and then purge_volume[2]). The first call moves the directory to a volumes_deleted directory. The second call does a deletion in depth of all the contents of that directory.
The last operation is the one that trigger the issue.
We had a similar issue in the past in Cinder. There, Arne proposed to do a deferred deletion of volumes. I think we could do the same in Manila for the cephfs driver.
The idea is to continue to call to the delete_volume. And then inside a periodic task in the driver, asynchronously it will get the contents of that directory and trigger the purge command.
I can propose the change and contribute with the code, but before going to deep I would like to know if there is a reason of having a singleton for the volume_client connection. If I compare with cinder code the connection is established and closed in each operation with the backend.
If you are not the maintainer, could you please point me to he/she? I can post it in the mailing list if you prefer
Cheers Jose Castro Leon CERN Cloud Infrastructure
[1] https://github.com/openstack/manila/blob/master/manila/share/drivers/cephfs/...
[2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7...
[2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7...
PS: The issue was triggered by one of our clients in kubernetes using the Manila CSI driver