[Openstack-operators] XFS documentation seems to conflict with recommendations in Swift

Jonathan Simms slyphon at gmail.com
Mon Oct 24 14:13:51 UTC 2011


Thanks all for the information! I'm going to use this advice as part
of the next round of hardware purchasing we're doing.


On Thu, Oct 13, 2011 at 6:11 PM, Gordon Irving <gordon.irving at sophos.com> wrote:
>
>
> If you are on a Battery Backed Unit raid controller, then its generally safe
> to disable barriers for journal filesystems.  If your doing soft raid, jbod,
> single disk arrays or cheaped out and did not get a BBU then you may want to
> enable barriers for filesystem consistency.
>
>
>
> For raid cards with a BBU then set your io scheduler to noop, and disable
> barriers.  The raid card does its own re-ordering of io operations, the OS
> has an incomplete picture of the true drive geometry.  The raid card is
> emulating one disk geometry which could be an array of 2 – 100+ disks.  The
> OS simply can not make good judgment calls on how best to schedule io to
> different parts of the disk because its built around the assumption of a
> single spinning disk.  This is also true for if a write has made it safely
> non persistent cache (ie disk cache),  to a persistent cache (ie the battery
> in your raid card) or persistent storage (that array of disks) .     This is
> a failure of the Raid card <-> OS interface.  There simply is not the
> richness to say (signal write is ok if on platter or persistent cache not
> okay in disk cache) or
>
>
>
> Enabling barriers effectively turns all writes into Write-Through
> operations, so the write goes straight to the disk platter and you get
> little performance benefit from the raid card (which hurts a lot in terms of
> lost iops).   If the BBU looses charge/fails  then the raid controller
> downgrades to Write-Through (vs Write-Backed) operation.
>
>
>
> BBU  raid controllers disable disk caches, as these are not safe in event of
> power loss, and do not provide any benefit over the raid card cache.
>
>
>
> In the context of swift, hdfs and other highly replicated datastores, I run
> them in jbod or raid-0 + nobarrier , noatime, nodiratime with a filesystem
> aligned to the geometry of underlying storage*  etc to squeeze as much
> performance as possible out of the raw storage.  Let the application layer
> deal with redundancy of data across the network, if a machine /disk dies …
> so what, you have N other copies of that data elsewhere on the network.  A
> bit of storage is lost … do consider how many nodes can be down at any time
> when operating these sorts of clusters Big boxen with lots of storage may
> seem attractive from a density perspective until you loose one and 25% of
> your storage capacity with it … many smaller baskets …
>
>
>
> For network level data consistency  swift should have a  data scrubber
> (periodic process to read and compare checksums of replicated blocks), I
> have not checked if this is implemented or on the roadmap.   I would be very
> surprised if this was not a part of swift.
>
>
>
> *you can hint to the fs layer how to offset block writes by specifying a
> stride width which is the number of data carrying disks in the array and the
> block size typically the default is 64k for raid arrays
>
>
>
> From: openstack-operators-bounces at lists.openstack.org
> [mailto:openstack-operators-bounces at lists.openstack.org] On Behalf Of Cole
> Crawford
> Sent: 13 October 2011 13:51
> To: openstack-operators at lists.openstack.org
> Subject: Re: [Openstack-operators] XFS documentation seems to conflict with
> recommendations in Swift
>
>
>
> generally mounting with -o nobarrier is a bad idea (ext4 or xfs), unless you
> have disks that do not have write caches. don't follow that
>
> recommendation, or for example - fsync won't work which is something swift
> relies upon.
>
>
>
>
>
> On Thu, Oct 13, 2011 at 9:18 AM, Marcelo Martins <btorch-os at zeroaccess.org>
> wrote:
>
> Hi Jonathan,
>
>
>
>
>
> I guess that will depend on how your storage nodes are configured (hardware
> wise).  The reason why it's recommended is because the storage drives are
> actually attached to a controller that has RiW cache enabled.
>
>
>
>
>
>
>
> Q. Should barriers be enabled with storage which has a persistent write
> cache?
>
> Many hardware RAID have a persistent write cache which preserves it across
> power failure, interface resets, system crashes, etc. Using write barriers
> in this instance is not recommended and will in fact lower performance.
> Therefore, it is recommended to turn off the barrier support and mount the
> filesystem with "nobarrier". But take care about the hard disk write cache,
> which should be off.
>
>
>
>
>
> Marcelo Martins
>
> Openstack-swift
>
> btorch-os at zeroaccess.org
>
>
>
> “Knowledge is the wings on which our aspirations take flight and soar. When
> it comes to surfing and life if you know what to do you can do it. If you
> desire anything become educated about it and succeed. “
>
>
>
>
>
>
>
> On Oct 12, 2011, at 10:08 AM, Jonathan Simms wrote:
>
> Hello all,
>
> I'm in the middle of a 120T Swift deployment, and I've had some
> concerns about the backing filesystem. I formatted everything with
> ext4 with 1024b inodes (for storing xattrs), but the process took so
> long that I'm now looking at XFS again. In particular, this concerns
> me http://xfs.org/index.php/XFS_FAQ#Write_barrier_support.
>
> In the swift documentation, it's recommended to mount the filesystems
> w/ 'nobarrier', but it would seem to me that this would leave the data
> open to corruption in the case of a crash. AFAIK, swift doesn't do
> checksumming (and checksum checking) of stored data (after it is
> written), which would mean that any data corruption would silently get
> passed back to the users.
>
> Now, I haven't had operational experience running XFS in production,
> I've mainly used ZFS, JFS, and ext{3,4}. Are there any recommendations
> for using XFS safely in production?
> _______________________________________________
> Openstack-operators mailing list
> Openstack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
>
> _______________________________________________
> Openstack-operators mailing list
> Openstack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>
>
> ________________________________
> Sophos Limited, The Pentagon, Abingdon Science Park, Abingdon, OX14 3YP,
> United Kingdom.
> Company Reg No 2096520. VAT Reg No GB 991 2418 08.
>
> _______________________________________________
> Openstack-operators mailing list
> Openstack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
>
>


More information about the Openstack-operators mailing list