[openstack-dev] [oslo][all] The lock files saga (and where we can go from here)

Clint Byrum clint at fewbar.com
Mon Nov 30 20:15:47 UTC 2015


Excerpts from Joshua Harlow's message of 2015-11-30 10:42:53 -0800:
> Hi all,
> 
> I just wanted to bring up an issue, possible solution and get feedback 
> on it from folks because it seems to be an on-going problem that shows 
> up not when an application is initially deployed but as on-going 
> operation and running of that application proceeds (ie after running for 
> a period of time).
> 
> The jist of the problem is the following:
> 
> A <<pick your favorite openstack project>> has a need to ensure that no 
> application on the same machine can manipulate a given resource on that 
> same machine, so it uses the lock file pattern (acquire a *local* lock 
> file for that resource, manipulate that resource, release that lock 
> file) to do actions on that resource in a safe manner (note this does 
> not ensure safety outside of that machine, lock files are *not* 
> distributed locks).
> 
> The api that we expose from oslo is typically accessed via the following:
> 
>    oslo_concurrency.lockutils.synchronized(name, lock_file_prefix=None, 
> external=False, lock_path=None, semaphores=None, delay=0.01)
> 
> or via its underlying library (that I extracted from oslo.concurrency 
> and have improved to add more usefulness) @ 
> http://fasteners.readthedocs.org/
> 
> The issue though for <<your favorite openstack project>> is that each of 
> these projects now typically has a large amount of lock files that exist 
> or have existed and no easy way to determine when those lock files can 
> be deleted (afaik no? periodic task exists in said projects to clean up 
> lock files, or to delete them when they are no longer in use...) so what 
> happens is bugs like https://bugs.launchpad.net/cinder/+bug/1432387 
> appear and there is no a simple solution to clean lock files up (since 
> oslo.concurrency is really not the right layer to know when a lock can 
> or can not be deleted, only the application knows that...)
> 
> So then we get a few creative solutions like the following:
> 
> - https://review.openstack.org/#/c/241663/
> - https://review.openstack.org/#/c/239678/
> - (and others?)
> 
> So I wanted to ask the question, how are people involved in <<your 
> favorite openstack project>> cleaning up these files (are they at all?)
> 
> Another idea that I have been proposing also is to use offset locks.
> 
> This would allow for not creating X lock files, but create a *single* 
> lock file per project and use offsets into it as the way to lock. For 
> example nova could/would create a 1MB (or larger/smaller) *empty* file 
> for locks, that would allow for 1,048,576 locks to be used at the same 
> time, which honestly should be way more than enough, and then there 
> would not need to be any lock cleanup at all... Is there any reason this 
> wasn't initially done back way when this lock file code was created? 
> (https://github.com/harlowja/fasteners/pull/10 adds this functionality 
> to the underlying library if people want to look it over)

This is really complicated, and basically just makes the directory of
lock files _look_ clean. But it still leaves each offset stale, and has
to be cleaned anyway.

Fasteners already has process locks that use fcntl/flock.

These locks provide enough to allow you to infer things about.  the owner
of the lock file. If there's no process still holding the exclusive lock
when you try to lock it, then YOU own it, and thus control the resource.

A cron job which tries to flock anything older than ${REASONABLE_TIME}
and deletes them seems fine. Whatever process was trying to interact
with the resource is gone at that point.

Now, anything that needs to safely manage a resource beyond without a
live process will need to keep track of its own state and be idempotent
anyway. IMO this isn't something lock files alone solve well. I believe
you're familiar with a library named taskflow that is supposed to help
write code that does this better ;). Even without taskflow, if you are
trying to do something exclusive without a single process that stays
alive, you need to do _something_ to keep track of state and restart
or revert that flow. That is a state management problem, not a locking
problem.



More information about the OpenStack-dev mailing list