[oslo][cinder] Lots of leftover files in /var/lib/cinder
Hello, By starting this thread I want to discuss about a knowed issue who impact several openstack projects. Projects who use oslo.concurrency lockutils to lock process have several leftover files who was not automatically removed. You can find a related issue on the Red Hat Bugzilla[1]. It's not really an oslo.concurrency issue it's a knowed fasteners issue[2] not fixed yet on the fasteners side but with some related changes[3] under review currently. oslo.concurrency already provide a work around[4] that all projects can use to fix that temporarely by waiting that the official fasteners fix will be released. I'm voluntary to help peoples and projects to use the oslo.concurrency cleaning method but I'm not sure where I need to put the changes (refer to [1]) outside the oslo scope. Also I guess other projects (nova, etc...) have the same issue. I need help from the expert of these projects to really know where we need to put changes (using oslo.concurrency remove_external_lock_file_with_prefix). Else if projects want to intodruces these changes I can help them by double checking with my oslo hat. Also I guess some projects reimplement the same approach that the oslo.concurrency module to lock process by using directly fasteners, in that case I thing they need to use oslo.concurrency to avoid the problem too. Do not hesitate to reply on this thread to trace useful informations and to add me on project reviews if you decide to introduce these changes on your side. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1647469 [2] https://github.com/harlowja/fasteners/issues/26 [3] https://github.com/harlowja/fasteners/pull/10 [4] https://docs.openstack.org/oslo.concurrency/latest/reference/lockutils.html#... Thank you for your attention. -- Hervé Beraud Senior Software Engineer Red Hat - Openstack Oslo irc: hberaud -----BEGIN PGP SIGNATURE----- wsFcBAABCAAQBQJb4AwCCRAHwXRBNkGNegAALSkQAHrotwCiL3VMwDR0vcja10Q+ Kf31yCutl5bAlS7tOKpPQ9XN4oC0ZSThyNNFVrg8ail0SczHXsC4rOrsPblgGRN+ RQLoCm2eO1AkB0ubCYLaq0XqSaO+Uk81QxAPkyPCEGT6SRxXr2lhADK0T86kBnMP F8RvGolu3EFjlqCVgeOZaR51PqwUlEhZXZuuNKrWZXg/oRiY4811GmnvzmUhgK5G 5+f8mUg74hfjDbR2VhjTeaLKp0PhskjOIKY3vqHXofLuaqFDD+WrAy/NgDGvN22g glGfj472T3xyHnUzM8ILgAGSghfzZF5Skj2qEeci9cB6K3Hm3osj+PbvfsXE/7Kw m/xtm+FjnaywZEv54uCmVIzQsRIm1qJscu20Qw6Q0UiPpDFqD7O6tWSRKdX11UTZ hwVQTMh9AKQDBEh2W9nnFi9kzSSNu4OQ1dRMcYHWfd9BEkccezxHwUM4Xyov5Fe0 qnbfzTB1tYkjU78loMWFaLa00ftSxP/DtQ//iYVyfVNfcCwfDszXLOqlkvGmY1/Y F1ON0ONekDZkGJsDoS6QdiUSn8RZ2mHArGEWMV00EV5DCIbCXRvywXV43ckx8Z+3 B8qUJhBqJ8RS2F+vTs3DTaXqcktgJ4UkhYC2c1gImcPRyGrK9VY0sCT+1iA+wp/O v6rDpkeNksZ9fFSyoY2o =ECSj -----END PGP SIGNATURE-----
On 3/26/19 11:27 AM, Herve Beraud wrote:
Hello,
By starting this thread I want to discuss about a knowed issuewho impact several openstack projects.
Projects who use oslo.concurrency lockutils to lock process have several leftover files who was not automatically removed.
You can find a related issue on the Red Hat Bugzilla[1].
This bug doesn't seem to be public. Fortunately it is a pretty well-known thing so I doubt the interested parties need it. :-)
It's not really an oslo.concurrency issue it's a knowed fasteners issue[2] not fixed yet on the fasteners side but with some related changes[3] under review currently.
Here's the thing: Somebody reports this "bug" about once every six months, but I have yet to see a report where anything is actually breaking. In my experience it is exclusively a cosmetic thing. Furthermore, past attempts to fix this behavior have always resulted in actual problems because it turns out that interprocess locking on Linux is a bit of a disaster. I've become rather hesitant to mess with this code over the years because of all the edge cases we keep running across. For example, looking at the proposed fixes in fasteners, I can tell you the lack of Windows support for offset locks is an issue. We can obviously fall back to file locks there, but it's one more code path to maintain, and one that is untested in the gate at that. I'm also curious if Victor's O_TMPFILE option works on NFS because I know we ran into an issue with that in the past too. So I guess what I'm saying is that "fixing" this "problem" is trickier than it might appear and I'm dubious of the value.
oslo.concurrency already provide a work around[4] that all projects can use to fix that temporarely by waiting that the official fasteners fix will be released.
I'm voluntary to help peoples and projects to use the oslo.concurrency cleaning method but I'm not sure where I need to put the changes (refer to [1]) outside the oslo scope.
Also I guess other projects (nova, etc...) have the same issue.
I believe Nova was actually the original consumer of the remove_external_lock API: https://review.openstack.org/#/c/144891/1/nova/virt/libvirt/imagecache.py It's possible we could do something similar for Cinder, but I have to admit that at first glance the locking strategy there doesn't quite make sense to me. Apparently each operation on a volume gets its own lock? That seems like it opens up the possibility of one process trying to update a volume while another deletes it.
I need help from the expert of these projects to really know where we need to put changes (using oslo.concurrency remove_external_lock_file_with_prefix).
This is part of the problem. You have to be very careful about removing lock files (note that this would apply to offset locks too) because if there's any chance another process would still try to use it you may create a race condition. Some lock files just can't be removed safely. I know at one point we had looked into deleting lock files instead of unlocking them, on the assumption that then any waiting locks would just fight over who gets to re-create the file. I don't think it actually worked though. Maybe the waiting locks didn't recognize that the file had gone away and waited indefinitely? My memory is pretty hazy though so it might be something to investigate again (and write down the results this time ;-).
Else if projects want to intodruces these changes I can help them by double checking with my oslo hat.
Also I guess some projects reimplement the same approach that the oslo.concurrency module to lock process by using directly fasteners, in that case I thing they need to use oslo.concurrency to avoid the problem too.
Do not hesitate to reply on this thread to trace useful informations and to add me on project reviews if you decide to introduce these changes on your side.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1647469 [2] https://github.com/harlowja/fasteners/issues/26 [3] https://github.com/harlowja/fasteners/pull/10 [4] https://docs.openstack.org/oslo.concurrency/latest/reference/lockutils.html#...
Thank you for your attention. -- Hervé Beraud Senior Software Engineer Red Hat - Openstack Oslo irc: hberaud -----BEGIN PGP SIGNATURE-----
wsFcBAABCAAQBQJb4AwCCRAHwXRBNkGNegAALSkQAHrotwCiL3VMwDR0vcja10Q+ Kf31yCutl5bAlS7tOKpPQ9XN4oC0ZSThyNNFVrg8ail0SczHXsC4rOrsPblgGRN+ RQLoCm2eO1AkB0ubCYLaq0XqSaO+Uk81QxAPkyPCEGT6SRxXr2lhADK0T86kBnMP F8RvGolu3EFjlqCVgeOZaR51PqwUlEhZXZuuNKrWZXg/oRiY4811GmnvzmUhgK5G 5+f8mUg74hfjDbR2VhjTeaLKp0PhskjOIKY3vqHXofLuaqFDD+WrAy/NgDGvN22g glGfj472T3xyHnUzM8ILgAGSghfzZF5Skj2qEeci9cB6K3Hm3osj+PbvfsXE/7Kw m/xtm+FjnaywZEv54uCmVIzQsRIm1qJscu20Qw6Q0UiPpDFqD7O6tWSRKdX11UTZ hwVQTMh9AKQDBEh2W9nnFi9kzSSNu4OQ1dRMcYHWfd9BEkccezxHwUM4Xyov5Fe0 qnbfzTB1tYkjU78loMWFaLa00ftSxP/DtQ//iYVyfVNfcCwfDszXLOqlkvGmY1/Y F1ON0ONekDZkGJsDoS6QdiUSn8RZ2mHArGEWMV00EV5DCIbCXRvywXV43ckx8Z+3 B8qUJhBqJ8RS2F+vTs3DTaXqcktgJ4UkhYC2c1gImcPRyGrK9VY0sCT+1iA+wp/O v6rDpkeNksZ9fFSyoY2o =ECSj -----END PGP SIGNATURE-----
On 26/03, Ben Nemec wrote:
On 3/26/19 11:27 AM, Herve Beraud wrote:
Hello,
By starting this thread I want to discuss about a knowed issuewho impact several openstack projects.
Projects who use oslo.concurrency lockutils to lock process have several leftover files who was not automatically removed.
You can find a related issue on the Red Hat Bugzilla[1].
This bug doesn't seem to be public. Fortunately it is a pretty well-known thing so I doubt the interested parties need it. :-)
It's not really an oslo.concurrency issue it's a knowed fasteners issue[2] not fixed yet on the fasteners side but with some related changes[3] under review currently.
Here's the thing: Somebody reports this "bug" about once every six months, but I have yet to see a report where anything is actually breaking. In my experience it is exclusively a cosmetic thing.
Furthermore, past attempts to fix this behavior have always resulted in actual problems because it turns out that interprocess locking on Linux is a bit of a disaster. I've become rather hesitant to mess with this code over the years because of all the edge cases we keep running across. For example, looking at the proposed fixes in fasteners, I can tell you the lack of Windows support for offset locks is an issue. We can obviously fall back to file locks there, but it's one more code path to maintain, and one that is untested in the gate at that. I'm also curious if Victor's O_TMPFILE option works on NFS because I know we ran into an issue with that in the past too.
So I guess what I'm saying is that "fixing" this "problem" is trickier than it might appear and I'm dubious of the value.
oslo.concurrency already provide a work around[4] that all projects can use to fix that temporarely by waiting that the official fasteners fix will be released.
I'm voluntary to help peoples and projects to use the oslo.concurrency cleaning method but I'm not sure where I need to put the changes (refer to [1]) outside the oslo scope.
Also I guess other projects (nova, etc...) have the same issue.
I believe Nova was actually the original consumer of the remove_external_lock API: https://review.openstack.org/#/c/144891/1/nova/virt/libvirt/imagecache.py
It's possible we could do something similar for Cinder, but I have to admit that at first glance the locking strategy there doesn't quite make sense to me. Apparently each operation on a volume gets its own lock? That seems like it opens up the possibility of one process trying to update a volume while another deletes it.
Hi, Minor clarification about Cinder and locks. In Cinder we prevent undesired concurrent access using locks and volume states using conditional DB changes. In the case of locks, like you say, we have a lock per volume, and we use the same lock on the appropriate methods. For example in the delete_volume we have @coordination.synchronized('{volume.id}-{f_name}') And in the clone operation we construct the same lock for the source volume: locked_action = "%s-%s" % (source_volid, 'delete_volume') And use it to run the creation flow: with coordination.COORDINATOR.get_lock(locked_action): _run_flow() Cheers, Gorka.
I need help from the expert of these projects to really know where we need to put changes (using oslo.concurrency remove_external_lock_file_with_prefix).
This is part of the problem. You have to be very careful about removing lock files (note that this would apply to offset locks too) because if there's any chance another process would still try to use it you may create a race condition. Some lock files just can't be removed safely.
I know at one point we had looked into deleting lock files instead of unlocking them, on the assumption that then any waiting locks would just fight over who gets to re-create the file. I don't think it actually worked though. Maybe the waiting locks didn't recognize that the file had gone away and waited indefinitely?
My memory is pretty hazy though so it might be something to investigate again (and write down the results this time ;-).
Else if projects want to intodruces these changes I can help them by double checking with my oslo hat.
Also I guess some projects reimplement the same approach that the oslo.concurrency module to lock process by using directly fasteners, in that case I thing they need to use oslo.concurrency to avoid the problem too.
Do not hesitate to reply on this thread to trace useful informations and to add me on project reviews if you decide to introduce these changes on your side.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1647469 [2] https://github.com/harlowja/fasteners/issues/26 [3] https://github.com/harlowja/fasteners/pull/10 [4] https://docs.openstack.org/oslo.concurrency/latest/reference/lockutils.html#...
Thank you for your attention. -- Hervé Beraud Senior Software Engineer Red Hat - Openstack Oslo irc: hberaud -----BEGIN PGP SIGNATURE-----
wsFcBAABCAAQBQJb4AwCCRAHwXRBNkGNegAALSkQAHrotwCiL3VMwDR0vcja10Q+ Kf31yCutl5bAlS7tOKpPQ9XN4oC0ZSThyNNFVrg8ail0SczHXsC4rOrsPblgGRN+ RQLoCm2eO1AkB0ubCYLaq0XqSaO+Uk81QxAPkyPCEGT6SRxXr2lhADK0T86kBnMP F8RvGolu3EFjlqCVgeOZaR51PqwUlEhZXZuuNKrWZXg/oRiY4811GmnvzmUhgK5G 5+f8mUg74hfjDbR2VhjTeaLKp0PhskjOIKY3vqHXofLuaqFDD+WrAy/NgDGvN22g glGfj472T3xyHnUzM8ILgAGSghfzZF5Skj2qEeci9cB6K3Hm3osj+PbvfsXE/7Kw m/xtm+FjnaywZEv54uCmVIzQsRIm1qJscu20Qw6Q0UiPpDFqD7O6tWSRKdX11UTZ hwVQTMh9AKQDBEh2W9nnFi9kzSSNu4OQ1dRMcYHWfd9BEkccezxHwUM4Xyov5Fe0 qnbfzTB1tYkjU78loMWFaLa00ftSxP/DtQ//iYVyfVNfcCwfDszXLOqlkvGmY1/Y F1ON0ONekDZkGJsDoS6QdiUSn8RZ2mHArGEWMV00EV5DCIbCXRvywXV43ckx8Z+3 B8qUJhBqJ8RS2F+vTs3DTaXqcktgJ4UkhYC2c1gImcPRyGrK9VY0sCT+1iA+wp/O v6rDpkeNksZ9fFSyoY2o =ECSj -----END PGP SIGNATURE-----
participants (3)
-
Ben Nemec
-
Gorka Eguileor
-
Herve Beraud