[openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

Joshua Harlow harlowja at fastmail.com
Mon Nov 30 20:22:28 UTC 2015


Joshua Harlow wrote:
> Ben Nemec wrote:
>> On 11/30/2015 12:42 PM, Joshua Harlow wrote:
>>> Hi all,
>>>
>>> I just wanted to bring up an issue, possible solution and get feedback
>>> on it from folks because it seems to be an on-going problem that shows
>>> up not when an application is initially deployed but as on-going
>>> operation and running of that application proceeds (ie after running for
>>> a period of time).
>>>
>>> The jist of the problem is the following:
>>>
>>> A<<pick your favorite openstack project>> has a need to ensure that no
>>> application on the same machine can manipulate a given resource on that
>>> same machine, so it uses the lock file pattern (acquire a *local* lock
>>> file for that resource, manipulate that resource, release that lock
>>> file) to do actions on that resource in a safe manner (note this does
>>> not ensure safety outside of that machine, lock files are *not*
>>> distributed locks).
>>>
>>> The api that we expose from oslo is typically accessed via the
>>> following:
>>>
>>> oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None,
>>> external=False, lock_path=None, semaphores=None, delay=0.01)
>>>
>>> or via its underlying library (that I extracted from oslo.concurrency
>>> and have improved to add more usefulness) @
>>> http://fasteners.readthedocs.org/
>>>
>>> The issue though for<<your favorite openstack project>> is that each of
>>> these projects now typically has a large amount of lock files that exist
>>> or have existed and no easy way to determine when those lock files can
>>> be deleted (afaik no? periodic task exists in said projects to clean up
>>> lock files, or to delete them when they are no longer in use...) so what
>>> happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387
>>> appear and there is no a simple solution to clean lock files up (since
>>> oslo.concurrency is really not the right layer to know when a lock can
>>> or can not be deleted, only the application knows that...)
>>>
>>> So then we get a few creative solutions like the following:
>>>
>>> - https://review.openstack.org/#/c/241663/
>>> - https://review.openstack.org/#/c/239678/
>>> - (and others?)
>>>
>>> So I wanted to ask the question, how are people involved in<<your
>>> favorite openstack project>> cleaning up these files (are they at all?)

 From some simple greps using:

$ echo "Removal usage in" $(basename `pwd`); grep -R 
remove_external_lock_file *

Removal usage in cinder
<none>

Removal usage in nova
nova/virt/libvirt/imagecache.py: 
lockutils.remove_external_lock_file(lock_file,

Removal usage in glance
<none>

Removal usage in neutron
<none>

So me thinks people aren't cleaning any of these up :-/

>>>
>>> Another idea that I have been proposing also is to use offset locks.
>>>
>>> This would allow for not creating X lock files, but create a *single*
>>> lock file per project and use offsets into it as the way to lock. For
>>> example nova could/would create a 1MB (or larger/smaller) *empty* file
>>> for locks, that would allow for 1,048,576 locks to be used at the same
>>> time, which honestly should be way more than enough, and then there
>>> would not need to be any lock cleanup at all... Is there any reason this
>>> wasn't initially done back way when this lock file code was created?
>>> (https://github.com/harlowja/fasteners/pull/10 adds this functionality
>>> to the underlying library if people want to look it over)
>>
>> I think the main reason was that even with a million locks available,
>> you'd have to find a way to hash the lock names to offsets in the file,
>> and a million isn't a very large collision space for that. Having two
>> differently named locks that hashed to the same offset would lead to
>> incredibly confusing bugs.
>>
>> We could switch to requiring the projects to provide the offsets instead
>> of hashing a string value, but that's just pushing the collision problem
>> off onto every project that uses us.
>>
>> So that's the problem as I understand it, but where does that leave us
>> for solutions? First, there's
>> https://github.com/openstack/oslo.concurrency/blob/master/oslo_concurrency/lockutils.py#L151
>>
>> which allows consumers to delete lock files when they're done with them.
>> Of course, in that case the onus is on the caller to make sure the lock
>> couldn't possibly be in use anymore.
>
> Ya, I wonder how many folks are actually doing this, because the exposed
> API of @synchronized doesn't seem to tell u what file to even delete in
> the first place :-/ perhaps we should make that more accessible so that
> people/consumers of that code could know what to delete...
>
>>
>> Second, is this actually a problem? Modern filesystems have absurdly
>> large limits on the number of files in a directory, so it's highly
>> unlikely we would ever exhaust that, and we're creating all zero byte
>> files so there shouldn't be a significant space impact either. In the
>> past I believe our recommendation has been to simply create a cleanup
>> job that runs on boot, before any of the OpenStack services start, that
>> deletes all of the lock files. At that point you know it's safe to
>> delete them, and it prevents your lock file directory from growing
>> forever.
>
> Except as we move to never shutting an app down (always online and live
> upgrades and all that jazz), it will have to run more than just on boot,
> but point taken.
>
>>
>> I know we've had this discussion in the past, but I don't think anyone
>> has ever told me that having lock files hang around was a functional
>> problem for them. It seems to be largely cosmetic complaints about not
>> cleaning up the old files (which, as you noted, Oslo can't really solve
>> because we have no idea when consumers are finished with locks) and
>> given the amount of trouble we've had with interprocess locking in the
>> past I've never felt that a cosmetic issue was sufficient reason to
>> reopen that can of worms. I'll just note again that every time we've
>> started messing with this stuff we run into a bunch of sticky problems
>> and edge cases, so it would take a pretty compelling argument to
>> convince me that we should do so again.
>>
>> Of course, if someone wants to take another stab at changing this stuff
>> again I guess more power to them, but to my knowledge we've finally had
>> our interprocess locking in a good state for a while now so I'm not in
>> favor of messing with it. That's my two cents -- don't spend it all in
>> one place. ;-)
>
> Fair enough, I can't answer to how much a priority consuming projects
> put on this, https://bugs.launchpad.net/cinder/+bug/1432387 got marked
> as 'high' so I assume at least one project doesn't like them but is it
> tolerable? (unsure?)
>
>>
>>> In general would like to hear peoples thoughts/ideas/complaints/other,
>>>
>>> -Josh
>>>
>>> __________________________________________________________________________
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list