Open Stack

Fri Jan 22 16:43:28 UTC 2016

On Fri, 2016-01-22 at 14:58 +0100, Premysl Kouril wrote:
> Hi Matt, James,
> 
> any thoughts on the below notes?

To be honest, not really.  You've repeated stage two of the Oracle
argument: wheel out benchmarks and attack alleged "complexity".  I
don't really have a great interest in repeating a historical argument. 
 Oracle didn't get it either until they released the feature and ran
into the huge management complexity of raw devices in the field, so if
you have the resources to repeat the experiment and see if you get
different results, be my guest.

The lesson I took from the Oracle affair all those years ago is that
it's far harder to replace well understood and functional file
interfaces with new ones (mainly because of the tooling and historical
understanding that comes with the old ones) than it is to gain
performance in existing interfaces.

The 3x difference in the benchmarks would seem to indicate a local
tuning or configuration problem, because it's not what most people see.
 What the current benchmarks seem to show is about a 1-5% difference
between the directio and the direct to block paths depending on fstype,
how its tuned, ioscheduler and underlying device.

James

> Best Regards,
> Prema
> On 19 Jan 2016 20:47, "Premysl Kouril" <premysl.kouril at gmail.com>
> wrote:
> 
> > Hi James,
> > 
> > 
> > > 
> > > You still haven't answered Anita's question: when you say
> > > "sponsor" do
> > > you mean provide resources to existing developers to work on your
> > > feature or provide new developers.
> > > 
> > 
> > I did, I am copy-pasting my response to Anita here again:
> > 
> > Both. We are first trying this "Are you asking for current Nova
> > developers to work on this feature?" and if we won't find anybody
> > we
> > will start with "your company interested in having your developers
> > interact with Nova developers"
> > 
> > 
> > > 
> > > Heh, this is history repeating itself from over a decade ago when
> > > Oracle would have confidently told you that Linux had to have raw
> > > devices because that's the only way a database will perform. 
> > >  Fast
> > > forward to today and all oracle databases use file backends.
> > > 
> > > Simplicity is also in the eye of the beholder.  LVM has a very
> > > simple
> > > naming structure whereas filesystems have complex hierarchical
> > > ones.
> > >  Once you start trying to scale to millions of instances, you'll
> > > find
> > > there's quite a management penalty for the LVM simplicity.
> > 
> > We won't definitely have millions instances on hypervisors but we
> > can
> > certainly have applications demanding million IOPS (in sum) from
> > hypervisor in near future.
> > 
> > > 
> > > >  It seems from our benchmarks that LVM behavior when
> > > > processing many IOPs (10s of thousands) is more stable than if
> > > > filesystem is used as backend.
> > > 
> > > It sounds like you haven't enabled directio here ... that was the
> > > solution to the oracle issue.
> > 
> > 
> > If you mean O_DIRECT mode then we had than one during our
> > benchmarks.
> > Here is our benchmark setup and results:
> > 
> > testing box configuration:
> > 
> >   CPU: 4x E7-8867 v3 (total of 64 physical cores)
> >   RAM: 1TB
> >   Storage: 12x enteprise class SSD disks (each disk 140 000/120 000
> > IOPS read/write)
> >                 disks connected via 12Gb/s SAS3 lanes
> > 
> >   So we are using big boxes which can run quite a lot of VMs.
> > 
> >   Out of the disks we create linux md raid (we did raid5 and
> > raid10)
> > and do some fine tuning:
> > 
> > 1) echo 8 > /sys/block/md127/md/group_thread_cnt - this increases
> > parallelism for raid5
> > 2) we boot kernel with scsi_mod.use_blk_mq=Y to active block io
> > multi
> > queueing
> > 3) we increase size of caching (for raid5)
> > 
> >  On that raid we either create LVM group or filesystem depending if
> > we
> > are testing LVM nova backend or file-based nova backend.
> > 
> > 
> > On this hypervisor we run nova/kvm and we provision 10-20 VMs and
> > we
> > run benchmark tests from these VMs and we are trying to saturate IO
> > on
> > hypervisor.
> > 
> > We use following command running inside the VMs:
> > 
> > fio --randrepeat=1 --ioengine=libaio --direct=1 -gtod_reduce=1
> > --name=test1 --bs=4k --iodepth=256 --size=20G --numjobs=1
> > --readwrite=randwrite
> > 
> > So you can see that in the guest OS we use --direct=1 which causes
> > the
> > test file to be opened with O_DIRECT. Actually I am now not sure
> > but
> > if using file-based backend then I hope that the virtual disk is
> > automatically opened with O_DIRECT and that it is done by
> > libvirt/qemu
> > by default without any explicit configuration.
> > 
> > Anyway, with this we have following results:
> > 
> > If we use file-based backend in Nova, ext4 filesystem and RAID5
> > then
> > in 8 parallel VMs we were able to achieve ~3000 IOPS per machine
> > which
> > means in total about 32000 IOPS.
> > 
> > If we use LVM-based backend,RAID5, 8 parallel VMs, we achieve
> > ~11000
> > IOPS per machine, in total about 90000 IOPS.
> > 
> > This is a significant difference.
> > 
> > This test was done about half a year ago by one of our engineers
> > who
> > no longer works for us but we still do have the box and everything,
> > so
> > if community is interested I can re-run the tests, again validate
> > results, do any reconfiguration etc.
> > 
> > 
> > 
> > > And this was precisely the Oracle argument.  The reason it
> > > foundered is
> > > that most FS complexity goes to manage the data structures ...
> > > the I/O
> > > path can still be made short and fast, as DirectIO demonstrates. 
> > >  Then
> > > the management penalty you pay (having to manage all the data
> > > structures that the filesystem would have managed for you) starts
> > > to
> > > outweigh any minor performance advantages.
> > 
> > The only thing O_DIRECT does is that it instructs the kernel to
> > skip
> > filesystem cache for the file opened in this mode. Rest of the
> > filesystem complexity remains in the IO's datapath. Note for
> > example -
> > we did a test on file-based backend with BTRFS - results were
> > absolutely horrible - there's just too much stuff filesystem has to
> > do
> > when processing IOs and we believe a lot of it is just not
> > necessary
> > when the storage is actually used to only store virtual disks.
> > 
> > Anyway, I am really glad that you brought these views, we are happy
> > to
> > reconsider our decisions so let's have a discussion I am sure we
> > missed many things when we were evaluating both backends.
> > 
> > One more question: What about the Cinder? I think they are using
> > LVM
> > for storing volumes, right? Why they don't use files?
> > 
> > Thanks,
> > Prema
> > 

Open Stack

[openstack-dev] [Nova] sponsor some LVM development

OpenStack

Community

Documentation

Branding & Legal