Open Stack

Tue Jan 19 19:47:07 UTC 2016

Hi James,

>
> You still haven't answered Anita's question: when you say "sponsor" do
> you mean provide resources to existing developers to work on your
> feature or provide new developers.
>

I did, I am copy-pasting my response to Anita here again:

Both. We are first trying this "Are you asking for current Nova
developers to work on this feature?" and if we won't find anybody we
will start with "your company interested in having your developers
interact with Nova developers"

>
> Heh, this is history repeating itself from over a decade ago when
> Oracle would have confidently told you that Linux had to have raw
> devices because that's the only way a database will perform.  Fast
> forward to today and all oracle databases use file backends.
>
> Simplicity is also in the eye of the beholder.  LVM has a very simple
> naming structure whereas filesystems have complex hierarchical ones.
>  Once you start trying to scale to millions of instances, you'll find
> there's quite a management penalty for the LVM simplicity.

We won't definitely have millions instances on hypervisors but we can
certainly have applications demanding million IOPS (in sum) from
hypervisor in near future.

>
>>  It seems from our benchmarks that LVM behavior when
>> processing many IOPs (10s of thousands) is more stable than if
>> filesystem is used as backend.
>
> It sounds like you haven't enabled directio here ... that was the
> solution to the oracle issue.

If you mean O_DIRECT mode then we had than one during our benchmarks.
Here is our benchmark setup and results:

testing box configuration:

  CPU: 4x E7-8867 v3 (total of 64 physical cores)
  RAM: 1TB
  Storage: 12x enteprise class SSD disks (each disk 140 000/120 000
IOPS read/write)
                disks connected via 12Gb/s SAS3 lanes

  So we are using big boxes which can run quite a lot of VMs.

  Out of the disks we create linux md raid (we did raid5 and raid10)
and do some fine tuning:

1) echo 8 > /sys/block/md127/md/group_thread_cnt - this increases
parallelism for raid5
2) we boot kernel with scsi_mod.use_blk_mq=Y to active block io multi queueing
3) we increase size of caching (for raid5)

 On that raid we either create LVM group or filesystem depending if we
are testing LVM nova backend or file-based nova backend.

On this hypervisor we run nova/kvm and we provision 10-20 VMs and we
run benchmark tests from these VMs and we are trying to saturate IO on
hypervisor.

We use following command running inside the VMs:

fio --randrepeat=1 --ioengine=libaio --direct=1 -gtod_reduce=1
--name=test1 --bs=4k --iodepth=256 --size=20G --numjobs=1
--readwrite=randwrite

So you can see that in the guest OS we use --direct=1 which causes the
test file to be opened with O_DIRECT. Actually I am now not sure but
if using file-based backend then I hope that the virtual disk is
automatically opened with O_DIRECT and that it is done by libvirt/qemu
by default without any explicit configuration.

Anyway, with this we have following results:

If we use file-based backend in Nova, ext4 filesystem and RAID5 then
in 8 parallel VMs we were able to achieve ~3000 IOPS per machine which
means in total about 32000 IOPS.

If we use LVM-based backend,RAID5, 8 parallel VMs, we achieve ~11000
IOPS per machine, in total about 90000 IOPS.

This is a significant difference.

This test was done about half a year ago by one of our engineers who
no longer works for us but we still do have the box and everything, so
if community is interested I can re-run the tests, again validate
results, do any reconfiguration etc.

> And this was precisely the Oracle argument.  The reason it foundered is
> that most FS complexity goes to manage the data structures ... the I/O
> path can still be made short and fast, as DirectIO demonstrates.  Then
> the management penalty you pay (having to manage all the data
> structures that the filesystem would have managed for you) starts to
> outweigh any minor performance advantages.

The only thing O_DIRECT does is that it instructs the kernel to skip
filesystem cache for the file opened in this mode. Rest of the
filesystem complexity remains in the IO's datapath. Note for example -
we did a test on file-based backend with BTRFS - results were
absolutely horrible - there's just too much stuff filesystem has to do
when processing IOs and we believe a lot of it is just not necessary
when the storage is actually used to only store virtual disks.

Anyway, I am really glad that you brought these views, we are happy to
reconsider our decisions so let's have a discussion I am sure we
missed many things when we were evaluating both backends.

One more question: What about the Cinder? I think they are using LVM
for storing volumes, right? Why they don't use files?

Thanks,
Prema

Open Stack

[openstack-dev] [Nova] sponsor some LVM development

OpenStack

Community

Documentation

Branding & Legal