[openstack-dev] [Nova] sponsor some LVM development

Premysl Kouril premysl.kouril at gmail.com
Fri Jan 22 13:58:35 UTC 2016

Hi Matt, James,

any thoughts on the below notes?

Best Regards,
On 19 Jan 2016 20:47, "Premysl Kouril" <premysl.kouril at gmail.com> wrote:

> Hi James,
> >
> > You still haven't answered Anita's question: when you say "sponsor" do
> > you mean provide resources to existing developers to work on your
> > feature or provide new developers.
> >
> I did, I am copy-pasting my response to Anita here again:
> Both. We are first trying this "Are you asking for current Nova
> developers to work on this feature?" and if we won't find anybody we
> will start with "your company interested in having your developers
> interact with Nova developers"
> >
> > Heh, this is history repeating itself from over a decade ago when
> > Oracle would have confidently told you that Linux had to have raw
> > devices because that's the only way a database will perform.  Fast
> > forward to today and all oracle databases use file backends.
> >
> > Simplicity is also in the eye of the beholder.  LVM has a very simple
> > naming structure whereas filesystems have complex hierarchical ones.
> >  Once you start trying to scale to millions of instances, you'll find
> > there's quite a management penalty for the LVM simplicity.
> We won't definitely have millions instances on hypervisors but we can
> certainly have applications demanding million IOPS (in sum) from
> hypervisor in near future.
> >
> >>  It seems from our benchmarks that LVM behavior when
> >> processing many IOPs (10s of thousands) is more stable than if
> >> filesystem is used as backend.
> >
> > It sounds like you haven't enabled directio here ... that was the
> > solution to the oracle issue.
> If you mean O_DIRECT mode then we had than one during our benchmarks.
> Here is our benchmark setup and results:
> testing box configuration:
>   CPU: 4x E7-8867 v3 (total of 64 physical cores)
>   RAM: 1TB
>   Storage: 12x enteprise class SSD disks (each disk 140 000/120 000
> IOPS read/write)
>                 disks connected via 12Gb/s SAS3 lanes
>   So we are using big boxes which can run quite a lot of VMs.
>   Out of the disks we create linux md raid (we did raid5 and raid10)
> and do some fine tuning:
> 1) echo 8 > /sys/block/md127/md/group_thread_cnt - this increases
> parallelism for raid5
> 2) we boot kernel with scsi_mod.use_blk_mq=Y to active block io multi
> queueing
> 3) we increase size of caching (for raid5)
>  On that raid we either create LVM group or filesystem depending if we
> are testing LVM nova backend or file-based nova backend.
> On this hypervisor we run nova/kvm and we provision 10-20 VMs and we
> run benchmark tests from these VMs and we are trying to saturate IO on
> hypervisor.
> We use following command running inside the VMs:
> fio --randrepeat=1 --ioengine=libaio --direct=1 -gtod_reduce=1
> --name=test1 --bs=4k --iodepth=256 --size=20G --numjobs=1
> --readwrite=randwrite
> So you can see that in the guest OS we use --direct=1 which causes the
> test file to be opened with O_DIRECT. Actually I am now not sure but
> if using file-based backend then I hope that the virtual disk is
> automatically opened with O_DIRECT and that it is done by libvirt/qemu
> by default without any explicit configuration.
> Anyway, with this we have following results:
> If we use file-based backend in Nova, ext4 filesystem and RAID5 then
> in 8 parallel VMs we were able to achieve ~3000 IOPS per machine which
> means in total about 32000 IOPS.
> If we use LVM-based backend,RAID5, 8 parallel VMs, we achieve ~11000
> IOPS per machine, in total about 90000 IOPS.
> This is a significant difference.
> This test was done about half a year ago by one of our engineers who
> no longer works for us but we still do have the box and everything, so
> if community is interested I can re-run the tests, again validate
> results, do any reconfiguration etc.
> > And this was precisely the Oracle argument.  The reason it foundered is
> > that most FS complexity goes to manage the data structures ... the I/O
> > path can still be made short and fast, as DirectIO demonstrates.  Then
> > the management penalty you pay (having to manage all the data
> > structures that the filesystem would have managed for you) starts to
> > outweigh any minor performance advantages.
> The only thing O_DIRECT does is that it instructs the kernel to skip
> filesystem cache for the file opened in this mode. Rest of the
> filesystem complexity remains in the IO's datapath. Note for example -
> we did a test on file-based backend with BTRFS - results were
> absolutely horrible - there's just too much stuff filesystem has to do
> when processing IOs and we believe a lot of it is just not necessary
> when the storage is actually used to only store virtual disks.
> Anyway, I am really glad that you brought these views, we are happy to
> reconsider our decisions so let's have a discussion I am sure we
> missed many things when we were evaluating both backends.
> One more question: What about the Cinder? I think they are using LVM
> for storing volumes, right? Why they don't use files?
> Thanks,
> Prema
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160122/0be95646/attachment.html>

More information about the OpenStack-dev mailing list