[Openstack] [Swift] tmp directory causing Swift slowdown

Shrinand Javadekar shrinand at maginatics.com
Thu Apr 30 18:27:14 UTC 2015


I was able to make the code change to create the tmp directory in the
3-byte hash directory and fix the unit tests to get this to work. I
will file a bug to get a discussion started on this, in case there are
people not following this thread.

On Wed, Apr 29, 2015 at 4:08 PM, Shrinand Javadekar
<shrinand at maginatics.com> wrote:
> Hi,
>
> I have been investigating a pretty serious Swift performance problem
> for a while now. I have a single node Swift instance with 16 cores,
> 64GB memory and 8 MDs of 3TB each. I only write 256KB objects into
> this Swift instance with high concurrency; 256 parallel object PUTs.
> Also, I was sharding the objects equally across 32 containers.
>
> On a completely clean system, we were getting ~375 object puts per
> second. But this kept on reducing pretty quickly and by the time we
> had 600GB of data in Swift, the throughput was ~100 objects per
> second.
>
> We used sysdig to get a trace of what's happening in the system and
> found that the open system calls were taking way longer; several 100s
> of milliseconds, sometimes even 1 second.
>
> Investigating this further revealed a problem in the way Swift writes
> the objects on XFS. Swift's object server creates a temp directory
> under the mount point /srv/node/r0. It create an file under this temp
> directory first (say /srv/node/r0/tmp/tmpASDF) and eventually renames
> this file to its final destination.
>
> rename /srv/node/r0/tmp/tmpASDF ->
> /srv/node/r0/objects/312/eef/deadbeef/33453453454323424.data.
>
> XFS creates an inode in the same allocation group as it parent. So,
> when the temp file tmpASDF is created, it goes in the same allocation
> group of "tmp". When the rename happens, only the filesystem metadata
> gets modified. The allocation groups of the inodes don't change.
>
> Since all object PUTs start off in the tmp directory, all inodes get
> created in the same allocation group. The B-tree used for keeping
> track of these inodes in the allocation group grows bigger and bigger
> as more files are written and parsing this tree for existence checks
> or for creating new inodes becomes more and more expensive.
>
> See this discussion [1] I had on the XFS mailing list where this issue
> was brought to light. And this other slightly old thread where the
> problem was identical [2].
>
> I validated this theory by periodically deleting the temp directory. I
> observed that the objects per second was not reducing at the same rate
> as earlier. Staring at ~375 obj/s, after 600GB data in Swift, I was
> getting ~340 obj/s.
>
> Now, how do we fix this?
>
> One option would be to make the temp directory somewhere deeper in the
> filesystem rather than immediately under the mount point. E.g. create
> one temp directory under each of the 3-byte hash directories. And use
> the temp directory corresponding to the object's hash.
>
> But, it's unclear what other repercussions will this have? Will the
> replicator start replicating this temp directory?
>
> Another option is to actually delete the tmp directory periodically.
> Problem is that we don't know when. And whenever we decide to do it,
> the temp directory may have some file in it making it impossible to
> delete the directory.
>
> Any other options?
>
> Thanks in advance.
> -Shri
>
> [1] http://www.spinics.net/lists/xfs/msg32868.html
> [2] http://xfs.9218.n7.nabble.com/Performance-degradation-over-time-td28514.html




More information about the Openstack mailing list