[Openstack] [Swift] tmp directory causing Swift slowdown

John Dickinson me at not.mn
Thu Apr 30 18:33:13 UTC 2015


Great, thanks. Sounds like a pretty interesting performance improvement.

--John


> On Apr 30, 2015, at 11:27 AM, Shrinand Javadekar <shrinand at maginatics.com> wrote:
> 
> I was able to make the code change to create the tmp directory in the
> 3-byte hash directory and fix the unit tests to get this to work. I
> will file a bug to get a discussion started on this, in case there are
> people not following this thread.
> 
> On Wed, Apr 29, 2015 at 4:08 PM, Shrinand Javadekar
> <shrinand at maginatics.com> wrote:
>> Hi,
>> 
>> I have been investigating a pretty serious Swift performance problem
>> for a while now. I have a single node Swift instance with 16 cores,
>> 64GB memory and 8 MDs of 3TB each. I only write 256KB objects into
>> this Swift instance with high concurrency; 256 parallel object PUTs.
>> Also, I was sharding the objects equally across 32 containers.
>> 
>> On a completely clean system, we were getting ~375 object puts per
>> second. But this kept on reducing pretty quickly and by the time we
>> had 600GB of data in Swift, the throughput was ~100 objects per
>> second.
>> 
>> We used sysdig to get a trace of what's happening in the system and
>> found that the open system calls were taking way longer; several 100s
>> of milliseconds, sometimes even 1 second.
>> 
>> Investigating this further revealed a problem in the way Swift writes
>> the objects on XFS. Swift's object server creates a temp directory
>> under the mount point /srv/node/r0. It create an file under this temp
>> directory first (say /srv/node/r0/tmp/tmpASDF) and eventually renames
>> this file to its final destination.
>> 
>> rename /srv/node/r0/tmp/tmpASDF ->
>> /srv/node/r0/objects/312/eef/deadbeef/33453453454323424.data.
>> 
>> XFS creates an inode in the same allocation group as it parent. So,
>> when the temp file tmpASDF is created, it goes in the same allocation
>> group of "tmp". When the rename happens, only the filesystem metadata
>> gets modified. The allocation groups of the inodes don't change.
>> 
>> Since all object PUTs start off in the tmp directory, all inodes get
>> created in the same allocation group. The B-tree used for keeping
>> track of these inodes in the allocation group grows bigger and bigger
>> as more files are written and parsing this tree for existence checks
>> or for creating new inodes becomes more and more expensive.
>> 
>> See this discussion [1] I had on the XFS mailing list where this issue
>> was brought to light. And this other slightly old thread where the
>> problem was identical [2].
>> 
>> I validated this theory by periodically deleting the temp directory. I
>> observed that the objects per second was not reducing at the same rate
>> as earlier. Staring at ~375 obj/s, after 600GB data in Swift, I was
>> getting ~340 obj/s.
>> 
>> Now, how do we fix this?
>> 
>> One option would be to make the temp directory somewhere deeper in the
>> filesystem rather than immediately under the mount point. E.g. create
>> one temp directory under each of the 3-byte hash directories. And use
>> the temp directory corresponding to the object's hash.
>> 
>> But, it's unclear what other repercussions will this have? Will the
>> replicator start replicating this temp directory?
>> 
>> Another option is to actually delete the tmp directory periodically.
>> Problem is that we don't know when. And whenever we decide to do it,
>> the temp directory may have some file in it making it impossible to
>> delete the directory.
>> 
>> Any other options?
>> 
>> Thanks in advance.
>> -Shri
>> 
>> [1] http://www.spinics.net/lists/xfs/msg32868.html
>> [2] http://xfs.9218.n7.nabble.com/Performance-degradation-over-time-td28514.html
> 
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20150430/c0fa57d7/attachment.sig>


More information about the Openstack mailing list