Dear all, Lately, one of our clients stored 300k files in a manila cephfs share. Then he deleted the share in Manila. This event make the driver unresponsive for several hours until all the data was removed in the cluster. We had a quick look at the code in manila [1] and the deletion is done first by calling the following api calls in the ceph bindings (delete_volume[1] and then purge_volume[2]). The first call moves the directory to a volumes_deleted directory. The second call does a deletion in depth of all the contents of that directory. The last operation is the one that trigger the issue. We had a similar issue in the past in Cinder. There, Arne proposed to do a deferred deletion of volumes. I think we could do the same in Manila for the cephfs driver. The idea is to continue to call to the delete_volume. And then inside a periodic task in the driver, asynchronously it will get the contents of that directory and trigger the purge command. I can propose the change and contribute with the code, but before going to deep I would like to know if there is a reason of having a singleton for the volume_client connection. If I compare with cinder code the connection is established and closed in each operation with the backend. If you are not the maintainer, could you please point me to he/she? I can post it in the mailing list if you prefer Cheers Jose Castro Leon CERN Cloud Infrastructure [1] https://github.com/openstack/manila/blob/master/manila/share/drivers/cephfs/... [2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7... [2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7... PS: The issue was triggered by one of our clients in kubernetes using the Manila CSI driver
On 12/07/19 13:03 +0000, Jose Castro Leon wrote:
Dear all,
Lately, one of our clients stored 300k files in a manila cephfs share. Then he deleted the share in Manila. This event make the driver unresponsive for several hours until all the data was removed in the cluster.
We had a quick look at the code in manila [1] and the deletion is done first by calling the following api calls in the ceph bindings (delete_volume[1] and then purge_volume[2]). The first call moves the directory to a volumes_deleted directory. The second call does a deletion in depth of all the contents of that directory.
The last operation is the one that trigger the issue.
We had a similar issue in the past in Cinder. There, Arne proposed to do a deferred deletion of volumes. I think we could do the same in Manila for the cephfs driver.
The idea is to continue to call to the delete_volume. And then inside a periodic task in the driver, asynchronously it will get the contents of that directory and trigger the purge command.
I can propose the change and contribute with the code, but before going to deep I would like to know if there is a reason of having a singleton for the volume_client connection. If I compare with cinder code the connection is established and closed in each operation with the backend.
If you are not the maintainer, could you please point me to he/she? I can post it in the mailing list if you prefer
Cheers Jose Castro Leon CERN Cloud Infrastructure
[1] https://github.com/openstack/manila/blob/master/manila/share/drivers/cephfs/...
[2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7...
[2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7...
PS: The issue was triggered by one of our clients in kubernetes using the Manila CSI driver
Hi Jose, Let's get this fixed since there's a lot of interest in Manila CSI driver and I think we can expect more batched deletes with it than we have had historically. I've copied Ramana Raja and Patrick Donnelly since they will be able to answer your question about the singleton volume_client connection more authoritatively than I can. Thanks for volunteering to propose a review to deal with this issue! -- Tom Barron
On Fri, Jul 12, 2019 at 6:45 PM Tom Barron <tpb@dyncloud.net> wrote:
On 12/07/19 13:03 +0000, Jose Castro Leon wrote:
Dear all,
Lately, one of our clients stored 300k files in a manila cephfs share. Then he deleted the share in Manila. This event make the driver unresponsive for several hours until all the data was removed in the cluster.
We had a quick look at the code in manila [1] and the deletion is done first by calling the following api calls in the ceph bindings (delete_volume[1] and then purge_volume[2]). The first call moves the directory to a volumes_deleted directory. The second call does a deletion in depth of all the contents of that directory.
The last operation is the one that trigger the issue.
We had a similar issue in the past in Cinder. There, Arne proposed to do a deferred deletion of volumes. I think we could do the same in Manila for the cephfs driver.
The idea is to continue to call to the delete_volume. And then inside a periodic task in the driver, asynchronously it will get the contents of that directory and trigger the purge command.
I can propose the change and contribute with the code, but before going to deep I would like to know if there is a reason of having a singleton for the volume_client connection. If I compare with cinder code the connection is established and closed in each operation with the backend.
If you are not the maintainer, could you please point me to he/she? I can post it in the mailing list if you prefer
Cheers Jose Castro Leon CERN Cloud Infrastructure
[1] https://github.com/openstack/manila/blob/master/manila/share/drivers/cephfs/...
[2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7...
[2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7...
PS: The issue was triggered by one of our clients in kubernetes using the Manila CSI driver
Hi Jose,
Let's get this fixed since there's a lot of interest in Manila CSI driver and I think we can expect more batched deletes with it than we have had historically.
The plan is to have manila's CephFS driver use the ceph-mgr's new volumes module, https://github.com/ceph/ceph/blob/master/src/pybind/mgr/volumes/module.py to create/delete manila groups/shares/snapshots, authorize/de-authorize access to the shares. Manila shares, essentially CephFS subdirectories with a specific data layout and quota, are referred to as FS subvolumes, and Ceph filesystems as FS volumes in the ceph-mgr volumes module. The ceph-mgr volumes modules is under active development. The latest Ceph CSI (v1.1.0) release is the first consumer of this module. The Ceph CSI issues CLI calls to the ceph-mgr to manage the lifecycle of the FS subvolumes, https://github.com/ceph/ceph-csi/pull/400 We're implementing the asynchronous purge of FS subvolumes in the ceph-mgr module. The PR is close to being merged, https://github.com/ceph/ceph/pull/28003/ https://github.com/ceph/ceph/pull/28003/commits/483a2141fe8c9a58bc25a544412c... http://tracker.ceph.com/issues/40036 Additional reviews will be great. Issuing the `ceph fs subvolume rm` command in the Ceph CSI driver (and later in the manila driver) will move the FS subvolume to a trash directory, whose contents will be asynchronously purged by a set of worker threads.
I've copied Ramana Raja and Patrick Donnelly since they will be able to answer your question about the singleton volume_client connection more authoritatively than I can.
Currently, in the mgr-volumes module we establish and close connection to a FS volume (a Ceph filesystem) for each FS subvolume (CephFS subdirectory within the filesystem) operation, https://github.com/ceph/ceph/pull/28082/commits/8d29816f0f3db6c7d287bbb7469d... Instead, we want to maintain a connection to a FS volume and perform operations on its subvolumes, until the FS volume is deleted. This would reduce the time taken to perform subvolume operations, important in CSI work loads (and in OpenStack workloads?). The code is in review, https://github.com/ceph/ceph/pull/28003/commits/5c41e949af9acabd612b0644de06... Thanks, Ramana
Thanks for volunteering to propose a review to deal with this issue!
-- Tom Barron
On 15/07/19 16:49 +0530, Ramana Raja wrote:
On Fri, Jul 12, 2019 at 6:45 PM Tom Barron <tpb@dyncloud.net> wrote:
On 12/07/19 13:03 +0000, Jose Castro Leon wrote:
Dear all,
Lately, one of our clients stored 300k files in a manila cephfs share. Then he deleted the share in Manila. This event make the driver unresponsive for several hours until all the data was removed in the cluster.
We had a quick look at the code in manila [1] and the deletion is done first by calling the following api calls in the ceph bindings (delete_volume[1] and then purge_volume[2]). The first call moves the directory to a volumes_deleted directory. The second call does a deletion in depth of all the contents of that directory.
The last operation is the one that trigger the issue.
We had a similar issue in the past in Cinder. There, Arne proposed to do a deferred deletion of volumes. I think we could do the same in Manila for the cephfs driver.
The idea is to continue to call to the delete_volume. And then inside a periodic task in the driver, asynchronously it will get the contents of that directory and trigger the purge command.
I can propose the change and contribute with the code, but before going to deep I would like to know if there is a reason of having a singleton for the volume_client connection. If I compare with cinder code the connection is established and closed in each operation with the backend.
If you are not the maintainer, could you please point me to he/she? I can post it in the mailing list if you prefer
Cheers Jose Castro Leon CERN Cloud Infrastructure
[1] https://github.com/openstack/manila/blob/master/manila/share/drivers/cephfs/...
[2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7...
[2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7...
PS: The issue was triggered by one of our clients in kubernetes using the Manila CSI driver
Hi Jose,
Let's get this fixed since there's a lot of interest in Manila CSI driver and I think we can expect more batched deletes with it than we have had historically.
The plan is to have manila's CephFS driver use the ceph-mgr's new volumes module, https://github.com/ceph/ceph/blob/master/src/pybind/mgr/volumes/module.py to create/delete manila groups/shares/snapshots, authorize/de-authorize access to the shares. Manila shares, essentially CephFS subdirectories with a specific data layout and quota, are referred to as FS subvolumes, and Ceph filesystems as FS volumes in the ceph-mgr volumes module.
The ceph-mgr volumes modules is under active development. The latest Ceph CSI (v1.1.0) release is the first consumer of this module. The Ceph CSI issues CLI calls to the ceph-mgr to manage the lifecycle of the FS subvolumes, https://github.com/ceph/ceph-csi/pull/400
We're implementing the asynchronous purge of FS subvolumes in the ceph-mgr module. The PR is close to being merged, https://github.com/ceph/ceph/pull/28003/ https://github.com/ceph/ceph/pull/28003/commits/483a2141fe8c9a58bc25a544412c... http://tracker.ceph.com/issues/40036 Additional reviews will be great. Issuing the `ceph fs subvolume rm` command in the Ceph CSI driver (and later in the manila driver) will move the FS subvolume to a trash directory, whose contents will be asynchronously purged by a set of worker threads.
I've copied Ramana Raja and Patrick Donnelly since they will be able to answer your question about the singleton volume_client connection more authoritatively than I can.
Currently, in the mgr-volumes module we establish and close connection to a FS volume (a Ceph filesystem) for each FS subvolume (CephFS subdirectory within the filesystem) operation, https://github.com/ceph/ceph/pull/28082/commits/8d29816f0f3db6c7d287bbb7469d...
Instead, we want to maintain a connection to a FS volume and perform operations on its subvolumes, until the FS volume is deleted. This would reduce the time taken to perform subvolume operations, important in CSI work loads (and in OpenStack workloads?). Th code is in review, https://github.com/ceph/ceph/pull/28003/commits/5c41e949af9acabd612b0644de06...
Thanks, Ramana
Thanks for volunteering to propose a review to deal with this issue!
-- Tom Barron
Jose, I think it will be better to have the async expunge managed by CephFS itself rather than by a periodic task in Manila. For one thing, non-Manila CephFS clients like Ceph-CSI have the same issue so they will benefit from the approach Ramana describes. Other storage backends that need to reclaim space for deleted shares/volumes in the background do this themselves rather than relying on the client (Manila in this case) to manage expunge or equivalents. Do you agree? Victoria Martinez de la Cruz will be working more generally to adapt the Manila CephFS driver to use 'ceph fs subvolume' rather than the current ceph_volume library calls, so perhaps you can propose modification of the share deletion code in that context. I understand from Ramana that the new interface that we need to CephFS for all this stuff will be in Nautilus, so we'll need to think through compatability issues for deployers with earlier versions of Ceph. Does needing to use Ceph Nautilus to get a proper solution for volume expunges pose any significant issues for CERN? Thanks, -- Tom Barron
Re-sending the email as it didn't get posted in the ML. On Fri, Jul 12, 2019 at 6:45 PM Tom Barron <tpb@dyncloud.net> wrote:
On 12/07/19 13:03 +0000, Jose Castro Leon wrote:
Dear all,
Lately, one of our clients stored 300k files in a manila cephfs share. Then he deleted the share in Manila. This event make the driver unresponsive for several hours until all the data was removed in the cluster.
We had a quick look at the code in manila [1] and the deletion is done first by calling the following api calls in the ceph bindings (delete_volume[1] and then purge_volume[2]). The first call moves the directory to a volumes_deleted directory. The second call does a deletion in depth of all the contents of that directory.
The last operation is the one that trigger the issue.
We had a similar issue in the past in Cinder. There, Arne proposed to do a deferred deletion of volumes. I think we could do the same in Manila for the cephfs driver.
The idea is to continue to call to the delete_volume. And then inside a periodic task in the driver, asynchronously it will get the contents of that directory and trigger the purge command.
I can propose the change and contribute with the code, but before going to deep I would like to know if there is a reason of having a singleton for the volume_client connection. If I compare with cinder code the connection is established and closed in each operation with the backend.
If you are not the maintainer, could you please point me to he/she? I can post it in the mailing list if you prefer
Cheers Jose Castro Leon CERN Cloud Infrastructure
[1] https://github.com/openstack/manila/blob/master/manila/share/drivers/cephfs/...
[2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7...
[2] https://github.com/ceph/ceph/blob/master/src/pybind/ceph_volume_client.py#L7...
PS: The issue was triggered by one of our clients in kubernetes using the Manila CSI driver
Hi Jose,
Let's get this fixed since there's a lot of interest in Manila CSI driver and I think we can expect more batched deletes with it than we have had historically.
The plan is to have manila's CephFS driver use the ceph-mgr's new volumes module, https://github.com/ceph/ceph/blob/master/src/pybind/mgr/volumes/module.py to create/delete manila groups/shares/snapshots, authorize/de-authorize access to the shares. Manila shares, essentially CephFS subdirectories with a specific data layout and quota, are referred to as FS subvolumes, and Ceph filesystems as FS volumes in the ceph-mgr volumes module. The ceph-mgr volumes modules is under active development. The latest Ceph CSI (v1.1.0) release is the first consumer of this module. The Ceph CSI issues CLI calls to the ceph-mgr to manage the lifecycle of the FS subvolumes, https://github.com/ceph/ceph-csi/pull/400 We're implementing the asynchronous purge of FS subvolumes in the ceph-mgr module. The PR is close to being merged, https://github.com/ceph/ceph/pull/28003/ https://github.com/ceph/ceph/pull/28003/commits/483a2141fe8c9a58bc25a544412c... http://tracker.ceph.com/issues/40036 Issuing the `ceph fs subvolume rm` command in the Ceph CSI driver (and later in the manila driver) will move the FS subvolume to a trash directory, whose contents will be asynchronously purged by a set of worker threads.
I've copied Ramana Raja and Patrick Donnelly since they will be able to answer your question about the singleton volume_client connection more authoritatively than I can.
Currently, in the mgr-volumes module we establish and close connection to a FS volume (a Ceph filesystem) for each FS subvolume (CephFS subdirectory within the filesystem) operation, https://github.com/ceph/ceph/pull/28082/commits/8d29816f0f3db6c7d287bbb7469d... Instead, we want to maintain a connection to a FS volume and perform operations on its subvolumes, until the FS volume is deleted. This would reduce the time taken to perform subvolume operations, important in CSI work loads (and in OpenStack workloads?). The code is in review, https://github.com/ceph/ceph/pull/28003/commits/5c41e949af9acabd612b0644de06... Thanks, Ramana
Thanks for volunteering to propose a review to deal with this issue!
-- Tom Barron
participants (3)
-
Jose Castro Leon
-
Ramana Venkatesh Raja
-
Tom Barron