Thanks Ben for your feedbacks. I already tried to follow the `remove_external_lock_file` few months ago, but unfortunately, I don't think we can goes like this with Cinder... As Gorka has explained to me few months ago:
Those are not the only type of locks we use in Cinder. Those are the ones we call "Global locks" and use TooZ so the DLM can be configured for Cinder Active-Active.
We also use Oslo's synchronized locks.
More information is available in the Cinder HA dev ref I wrote last year. It has a section dedicated to the topic of mutual exclusion and the 4 types we currently have in Cinder [1]:
- Database locking using resource states. - Process locks. - Node locks. - Global locks.
As for calling the remove_external_lock_file_with_prefix directly on delete, I don't think that's something we can do, as the locks may still be in use. Example:
- Start deleting volume -> get lock - Try to clone volume -> wait for lock - Finish deleting volume -> release and delete lock - Cloning recreates the lock when acquiring it - Cloning fails because the volume no longer exists but leaves the lock
So the Cinder workflow and mechanisms seems to definitively forbid to us the possibility to use the remove features of oslo.concurrency... Also like discussed on the review (https://review.opendev.org/#/c/688413), this issue can't be fixed in the underlying libraries, and I think that if we want to fix that on stable branches then Cinder need to address it directly by adding some piece of code who will be triggered if needed and in a safely manner, in other words, only Cinder can really address it and remove safely these file. See the discussion extract on the review ( https://review.opendev.org/#/c/688413):
Thanks Gorka for your feedback, then in view of all the discussions about this topic I suppose only Cinder can really address it safely on stable branches.
It is not a safe assumption that *-delete_volume file locks can be removed just because they have not been used in a couple of days. A new volume clone could come in that would use it and then we could have a race condition if the cron job was running.
The only way to be sure that it can be removed is checking in the Cinder DB and making sure that the volume has been deleted or it doesn't even exist (DB has been purged).
Same thing with detach_volume, delete_snapshot, and those that are directly volume ids locks.
I definitely think that it can't be fixed in the underlying libraries like Eric has suggested [1], indeed, as you has explained only Cinder can know if a lock file can be removed safely.
In my opinion the fix should be done in fasteners, or we should add code in Cinder that cleans up all locks related to a volume or snapshot when this one is deleted.
I agree the most better solution is to fix the root cause and so to fix fasteners, but I don't think it's can be backported to stable branches because we will need to bump a requirement version on stable branche in this case and also because it'll introduce new features, so I guess Cinder need to add some code to remove these files and possibly backport it to stable branches.
[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-September/009563...
The Fasteners fix IMHO can only be used by future versions of openstack, due to the version bump and due to the new features added. I think that it could be available only from the ussuri or future cycle like V. The main goal of the cron approach was to definitively avoid to unearth this topic each 6 months, try to address it on stable branches, and try to take care of the file system usage even if it's a theoretical issue, but by getting feedbacks from the Cinder team and their warnings I don't think that this track is still followable. Definitely, this is not an oslo.concurrency bug. Anyway your proposed "Administrator Guide" is a must to have, to track things in one place, inform users and avoid to spend time to explain the same things again and again about this topic... so it's worth-it. I'll review it and propose my related knowledge on this topic. oslo.concurrency can't address this safely because we risk to introduce race conditions and worse situations than the leftover lock files. So, due to all these elements, only cinder can address it for the moment and for fix that on stable branches too. Le mer. 16 oct. 2019 à 00:15, Ben Nemec <openstack@nemebean.com> a écrit :
In the interest of not having to start this discussion from scratch every time, I've done a bit of a brain dump into https://review.opendev.org/#/c/688825/ that covers why things are the way they are and what we recommend people do about it. Please take a look and let me know if you see any issues with it.
Thanks.
-Ben
-- Hervé Beraud Senior Software Engineer Red Hat - Openstack Oslo irc: hberaud -----BEGIN PGP SIGNATURE----- wsFcBAABCAAQBQJb4AwCCRAHwXRBNkGNegAALSkQAHrotwCiL3VMwDR0vcja10Q+ Kf31yCutl5bAlS7tOKpPQ9XN4oC0ZSThyNNFVrg8ail0SczHXsC4rOrsPblgGRN+ RQLoCm2eO1AkB0ubCYLaq0XqSaO+Uk81QxAPkyPCEGT6SRxXr2lhADK0T86kBnMP F8RvGolu3EFjlqCVgeOZaR51PqwUlEhZXZuuNKrWZXg/oRiY4811GmnvzmUhgK5G 5+f8mUg74hfjDbR2VhjTeaLKp0PhskjOIKY3vqHXofLuaqFDD+WrAy/NgDGvN22g glGfj472T3xyHnUzM8ILgAGSghfzZF5Skj2qEeci9cB6K3Hm3osj+PbvfsXE/7Kw m/xtm+FjnaywZEv54uCmVIzQsRIm1qJscu20Qw6Q0UiPpDFqD7O6tWSRKdX11UTZ hwVQTMh9AKQDBEh2W9nnFi9kzSSNu4OQ1dRMcYHWfd9BEkccezxHwUM4Xyov5Fe0 qnbfzTB1tYkjU78loMWFaLa00ftSxP/DtQ//iYVyfVNfcCwfDszXLOqlkvGmY1/Y F1ON0ONekDZkGJsDoS6QdiUSn8RZ2mHArGEWMV00EV5DCIbCXRvywXV43ckx8Z+3 B8qUJhBqJ8RS2F+vTs3DTaXqcktgJ4UkhYC2c1gImcPRyGrK9VY0sCT+1iA+wp/O v6rDpkeNksZ9fFSyoY2o =ECSj -----END PGP SIGNATURE-----