[Openstack-operators] XFS documentation seems to conflict with recommendations in Swift

Gordon Irving gordon.irving at sophos.com
Thu Oct 13 22:11:59 UTC 2011

If you are on a Battery Backed Unit raid controller, then its generally safe to disable barriers for journal filesystems.  If your doing soft raid, jbod, single disk arrays or cheaped out and did not get a BBU then you may want to enable barriers for filesystem consistency.

For raid cards with a BBU then set your io scheduler to noop, and disable barriers.  The raid card does its own re-ordering of io operations, the OS has an incomplete picture of the true drive geometry.  The raid card is emulating one disk geometry which could be an array of 2 - 100+ disks.  The OS simply can not make good judgment calls on how best to schedule io to different parts of the disk because its built around the assumption of a single spinning disk.  This is also true for if a write has made it safely non persistent cache (ie disk cache),  to a persistent cache (ie the battery in your raid card) or persistent storage (that array of disks) .     This is a failure of the Raid card <-> OS interface.  There simply is not the richness to say (signal write is ok if on platter or persistent cache not okay in disk cache) or

Enabling barriers effectively turns all writes into Write-Through operations, so the write goes straight to the disk platter and you get little performance benefit from the raid card (which hurts a lot in terms of lost iops).   If the BBU looses charge/fails  then the raid controller downgrades to Write-Through (vs Write-Backed) operation.

BBU  raid controllers disable disk caches, as these are not safe in event of power loss, and do not provide any benefit over the raid card cache.

In the context of swift, hdfs and other highly replicated datastores, I run them in jbod or raid-0 + nobarrier , noatime, nodiratime with a filesystem aligned to the geometry of underlying storage*  etc to squeeze as much performance as possible out of the raw storage.  Let the application layer deal with redundancy of data across the network, if a machine /disk dies ... so what, you have N other copies of that data elsewhere on the network.  A bit of storage is lost ... do consider how many nodes can be down at any time when operating these sorts of clusters Big boxen with lots of storage may seem attractive from a density perspective until you loose one and 25% of your storage capacity with it ... many smaller baskets ...

For network level data consistency  swift should have a  data scrubber (periodic process to read and compare checksums of replicated blocks), I have not checked if this is implemented or on the roadmap.   I would be very surprised if this was not a part of swift.

*you can hint to the fs layer how to offset block writes by specifying a stride width which is the number of data carrying disks in the array and the block size typically the default is 64k for raid arrays

From: openstack-operators-bounces at lists.openstack.org [mailto:openstack-operators-bounces at lists.openstack.org] On Behalf Of Cole Crawford
Sent: 13 October 2011 13:51
To: openstack-operators at lists.openstack.org
Subject: Re: [Openstack-operators] XFS documentation seems to conflict with recommendations in Swift

generally mounting with -o nobarrier is a bad idea (ext4 or xfs), unless you have disks that do not have write caches. don't follow that

recommendation, or for example - fsync won't work which is something swift relies upon.

On Thu, Oct 13, 2011 at 9:18 AM, Marcelo Martins <btorch-os at zeroaccess.org<mailto:btorch-os at zeroaccess.org>> wrote:
Hi Jonathan,

I guess that will depend on how your storage nodes are configured (hardware wise).  The reason why it's recommended is because the storage drives are actually attached to a controller that has RiW cache enabled.

Q. Should barriers be enabled with storage which has a persistent write cache?
Many hardware RAID have a persistent write cache which preserves it across power failure, interface resets, system crashes, etc. Using write barriers in this instance is not recommended and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with "nobarrier". But take care about the hard disk write cache, which should be off.

Marcelo Martins
btorch-os at zeroaccess.org<mailto:btorch-os at zeroaccess.org>

"Knowledge is the wings on which our aspirations take flight and soar. When it comes to surfing and life if you know what to do you can do it. If you desire anything become educated about it and succeed. "

On Oct 12, 2011, at 10:08 AM, Jonathan Simms wrote:

Hello all,

I'm in the middle of a 120T Swift deployment, and I've had some
concerns about the backing filesystem. I formatted everything with
ext4 with 1024b inodes (for storing xattrs), but the process took so
long that I'm now looking at XFS again. In particular, this concerns
me http://xfs.org/index.php/XFS_FAQ#Write_barrier_support.

In the swift documentation, it's recommended to mount the filesystems
w/ 'nobarrier', but it would seem to me that this would leave the data
open to corruption in the case of a crash. AFAIK, swift doesn't do
checksumming (and checksum checking) of stored data (after it is
written), which would mean that any data corruption would silently get
passed back to the users.

Now, I haven't had operational experience running XFS in production,
I've mainly used ZFS, JFS, and ext{3,4}. Are there any recommendations
for using XFS safely in production?
Openstack-operators mailing list
Openstack-operators at lists.openstack.org<mailto:Openstack-operators at lists.openstack.org>

Openstack-operators mailing list
Openstack-operators at lists.openstack.org<mailto:Openstack-operators at lists.openstack.org>

Sophos Limited, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP, United Kingdom.
Company Reg No 2096520. VAT Reg No GB 991 2418 08.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-operators/attachments/20111013/48775c35/attachment-0002.html>

More information about the Openstack-operators mailing list