Re: [cinder][tooz]Lock-files are remained

17 Oct 2019

      Thanks Ben for your feedbacks.

I already tried to follow the `remove_external_lock_file` few months ago,
but unfortunately, I don't think we can goes like this with Cinder...

As Gorka has explained to me few months ago:
...
Those are not the only type of locks we use in Cinder.  Those are the
ones we call "Global locks" and use TooZ so the DLM can be configured
for Cinder Active-Active.
We also use Oslo's synchronized locks.
More information is available in the Cinder HA dev ref I wrote last
year.  It has a section dedicated to the topic of mutual exclusion and
the 4 types we currently have in Cinder [1]:
- Database locking using resource states.
- Process locks.
- Node locks.
- Global locks.
As for calling the remove_external_lock_file_with_prefix directly on
delete, I don't think that's something we can do, as the locks may still
be in use. Example:
- Start deleting volume -> get lock
- Try to clone volume -> wait for lock
- Finish deleting volume -> release and delete lock
- Cloning recreates the lock when acquiring it
- Cloning fails because the volume no longer exists but leaves the lock
So the Cinder workflow and mechanisms seems to definitively forbid to us
the possibility to use the remove features of oslo.concurrency...

Also like discussed on the review (https://review.opendev.org/#/c/688413),
this issue can't be fixed in the underlying libraries, and I think that if
we want to fix that on stable branches then Cinder need to address it
directly by adding some piece of code who will be triggered if needed and
in a safely manner, in other words, only Cinder can really address it and
remove safely these file.

See the discussion extract on the review (
https://review.opendev.org/#/c/688413):
...
Thanks Gorka for your feedback, then in view of all the discussions
about this topic I suppose only Cinder can really address it safely
on stable branches.
...
It is not a safe assumption that *-delete_volume file locks can be
removed just because they have not been used in a couple of days.
A new volume clone could come in that would use it and then we
could have a race condition if the cron job was running.
The only way to be sure that it can be removed is checking in the
Cinder DB and making sure that the volume has been deleted or it
doesn't even exist (DB has been purged).
Same thing with detach_volume, delete_snapshot, and those that are
directly volume ids locks.
I definitely think that it can't be fixed in the underlying
libraries like Eric has suggested [1], indeed, as you has explained
only Cinder can know if a lock file can be removed safely.
...
In my opinion the fix should be done in fasteners, or we should add
code in Cinder that cleans up all locks related to a volume or
snapshot when this one is deleted.
I agree the most better solution is to fix the root cause and so to
fix fasteners, but I don't think it's can be backported to stable
branches because we will need to bump a requirement version on
stable branche in this case and also because it'll introduce new
features, so I guess Cinder need to add some code to remove these
files and possibly backport it to stable branches.
[1]
http://lists.openstack.org/pipermail/openstack-discuss/2019-September/009563...
The Fasteners fix IMHO can only be used by future versions of openstack,
due to the version bump and due to the new features added. I think that it
could be available only from the ussuri or future cycle like V.

The main goal of the cron approach was to definitively avoid to unearth
this topic each 6 months, try to address it on stable branches, and try to
take care of the file system usage even if it's a theoretical issue, but by
getting feedbacks from the Cinder team and their warnings I don't think
that this track is still followable.

Definitely, this is not an oslo.concurrency bug. Anyway your proposed
"Administrator Guide" is a must to have, to track things in one place,
inform users and avoid to spend time to explain the same things again and
again about this topic... so it's worth-it. I'll review it and propose my
related knowledge on this topic. oslo.concurrency can't address this safely
because we risk to introduce race conditions and worse situations than the
leftover lock files.

So, due to all these elements, only cinder can address it for the moment
and for fix that on stable branches too.

Le mer. 16 oct. 2019 à 00:15, Ben Nemec <openstack@nemebean.com> a écrit :
...
In the interest of not having to start this discussion from scratch
every time, I've done a bit of a brain dump into
https://review.opendev.org/#/c/688825/ that covers why things are the
way they are and what we recommend people do about it. Please take a
look and let me know if you see any issues with it.
Thanks.
-Ben
-- 
Hervé Beraud
Senior Software Engineer
Red Hat - Openstack Oslo
irc: hberaud
-----BEGIN PGP SIGNATURE-----

wsFcBAABCAAQBQJb4AwCCRAHwXRBNkGNegAALSkQAHrotwCiL3VMwDR0vcja10Q+
Kf31yCutl5bAlS7tOKpPQ9XN4oC0ZSThyNNFVrg8ail0SczHXsC4rOrsPblgGRN+
RQLoCm2eO1AkB0ubCYLaq0XqSaO+Uk81QxAPkyPCEGT6SRxXr2lhADK0T86kBnMP
F8RvGolu3EFjlqCVgeOZaR51PqwUlEhZXZuuNKrWZXg/oRiY4811GmnvzmUhgK5G
5+f8mUg74hfjDbR2VhjTeaLKp0PhskjOIKY3vqHXofLuaqFDD+WrAy/NgDGvN22g
glGfj472T3xyHnUzM8ILgAGSghfzZF5Skj2qEeci9cB6K3Hm3osj+PbvfsXE/7Kw
m/xtm+FjnaywZEv54uCmVIzQsRIm1qJscu20Qw6Q0UiPpDFqD7O6tWSRKdX11UTZ
hwVQTMh9AKQDBEh2W9nnFi9kzSSNu4OQ1dRMcYHWfd9BEkccezxHwUM4Xyov5Fe0
qnbfzTB1tYkjU78loMWFaLa00ftSxP/DtQ//iYVyfVNfcCwfDszXLOqlkvGmY1/Y
F1ON0ONekDZkGJsDoS6QdiUSn8RZ2mHArGEWMV00EV5DCIbCXRvywXV43ckx8Z+3
B8qUJhBqJ8RS2F+vTs3DTaXqcktgJ4UkhYC2c1gImcPRyGrK9VY0sCT+1iA+wp/O
v6rDpkeNksZ9fFSyoY2o
=ECSj
-----END PGP SIGNATURE-----

Re: [cinder][tooz]Lock-files are remained

Herve Beraud