Open Stack

Wed Apr 29 23:08:36 UTC 2015

Hi,

I have been investigating a pretty serious Swift performance problem
for a while now. I have a single node Swift instance with 16 cores,
64GB memory and 8 MDs of 3TB each. I only write 256KB objects into
this Swift instance with high concurrency; 256 parallel object PUTs.
Also, I was sharding the objects equally across 32 containers.

On a completely clean system, we were getting ~375 object puts per
second. But this kept on reducing pretty quickly and by the time we
had 600GB of data in Swift, the throughput was ~100 objects per
second.

We used sysdig to get a trace of what's happening in the system and
found that the open system calls were taking way longer; several 100s
of milliseconds, sometimes even 1 second.

Investigating this further revealed a problem in the way Swift writes
the objects on XFS. Swift's object server creates a temp directory
under the mount point /srv/node/r0. It create an file under this temp
directory first (say /srv/node/r0/tmp/tmpASDF) and eventually renames
this file to its final destination.

rename /srv/node/r0/tmp/tmpASDF ->
/srv/node/r0/objects/312/eef/deadbeef/33453453454323424.data.

XFS creates an inode in the same allocation group as it parent. So,
when the temp file tmpASDF is created, it goes in the same allocation
group of "tmp". When the rename happens, only the filesystem metadata
gets modified. The allocation groups of the inodes don't change.

Since all object PUTs start off in the tmp directory, all inodes get
created in the same allocation group. The B-tree used for keeping
track of these inodes in the allocation group grows bigger and bigger
as more files are written and parsing this tree for existence checks
or for creating new inodes becomes more and more expensive.

See this discussion [1] I had on the XFS mailing list where this issue
was brought to light. And this other slightly old thread where the
problem was identical [2].

I validated this theory by periodically deleting the temp directory. I
observed that the objects per second was not reducing at the same rate
as earlier. Staring at ~375 obj/s, after 600GB data in Swift, I was
getting ~340 obj/s.

Now, how do we fix this?

One option would be to make the temp directory somewhere deeper in the
filesystem rather than immediately under the mount point. E.g. create
one temp directory under each of the 3-byte hash directories. And use
the temp directory corresponding to the object's hash.

But, it's unclear what other repercussions will this have? Will the
replicator start replicating this temp directory?

Another option is to actually delete the tmp directory periodically.
Problem is that we don't know when. And whenever we decide to do it,
the temp directory may have some file in it making it impossible to
delete the directory.

Any other options?

Thanks in advance.
-Shri

[1] http://www.spinics.net/lists/xfs/msg32868.html
[2] http://xfs.9218.n7.nabble.com/Performance-degradation-over-time-td28514.html

Open Stack

[Openstack] [Swift] tmp directory causing Swift slowdown

OpenStack

Community

Documentation

Branding & Legal