<p dir="ltr">Hi Matt, James,</p>

<p dir="ltr">any thoughts on the below notes?</p>

<p dir="ltr">Best Regards,<br>

Prema</p>

<div class="gmail_quote">On 19 Jan 2016 20:47, "Premysl Kouril" <<a href="mailto:premysl.kouril@gmail.com">premysl.kouril@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi James,<br>

<br>

<br>

><br>

> You still haven't answered Anita's question: when you say "sponsor" do<br>

> you mean provide resources to existing developers to work on your<br>

> feature or provide new developers.<br>

><br>

<br>

I did, I am copy-pasting my response to Anita here again:<br>

<br>

Both. We are first trying this "Are you asking for current Nova<br>

developers to work on this feature?" and if we won't find anybody we<br>

will start with "your company interested in having your developers<br>

interact with Nova developers"<br>

<br>

<br>

><br>

> Heh, this is history repeating itself from over a decade ago when<br>

> Oracle would have confidently told you that Linux had to have raw<br>

> devices because that's the only way a database will perform.  Fast<br>

> forward to today and all oracle databases use file backends.<br>

><br>

> Simplicity is also in the eye of the beholder.  LVM has a very simple<br>

> naming structure whereas filesystems have complex hierarchical ones.<br>

>  Once you start trying to scale to millions of instances, you'll find<br>

> there's quite a management penalty for the LVM simplicity.<br>

<br>

We won't definitely have millions instances on hypervisors but we can<br>

certainly have applications demanding million IOPS (in sum) from<br>

hypervisor in near future.<br>

<br>

><br>

>>  It seems from our benchmarks that LVM behavior when<br>

>> processing many IOPs (10s of thousands) is more stable than if<br>

>> filesystem is used as backend.<br>

><br>

> It sounds like you haven't enabled directio here ... that was the<br>

> solution to the oracle issue.<br>

<br>

<br>

If you mean O_DIRECT mode then we had than one during our benchmarks.<br>

Here is our benchmark setup and results:<br>

<br>

testing box configuration:<br>

<br>

  CPU: 4x E7-8867 v3 (total of 64 physical cores)<br>

  RAM: 1TB<br>

  Storage: 12x enteprise class SSD disks (each disk 140 000/120 000<br>

IOPS read/write)<br>

                disks connected via 12Gb/s SAS3 lanes<br>

<br>

  So we are using big boxes which can run quite a lot of VMs.<br>

<br>

  Out of the disks we create linux md raid (we did raid5 and raid10)<br>

and do some fine tuning:<br>

<br>

1) echo 8 > /sys/block/md127/md/group_thread_cnt - this increases<br>

parallelism for raid5<br>

2) we boot kernel with scsi_mod.use_blk_mq=Y to active block io multi queueing<br>

3) we increase size of caching (for raid5)<br>

<br>

 On that raid we either create LVM group or filesystem depending if we<br>

are testing LVM nova backend or file-based nova backend.<br>

<br>

<br>

On this hypervisor we run nova/kvm and we provision 10-20 VMs and we<br>

run benchmark tests from these VMs and we are trying to saturate IO on<br>

hypervisor.<br>

<br>

We use following command running inside the VMs:<br>

<br>

fio --randrepeat=1 --ioengine=libaio --direct=1 -gtod_reduce=1<br>

--name=test1 --bs=4k --iodepth=256 --size=20G --numjobs=1<br>

--readwrite=randwrite<br>

<br>

So you can see that in the guest OS we use --direct=1 which causes the<br>

test file to be opened with O_DIRECT. Actually I am now not sure but<br>

if using file-based backend then I hope that the virtual disk is<br>

automatically opened with O_DIRECT and that it is done by libvirt/qemu<br>

by default without any explicit configuration.<br>

<br>

Anyway, with this we have following results:<br>

<br>

If we use file-based backend in Nova, ext4 filesystem and RAID5 then<br>

in 8 parallel VMs we were able to achieve ~3000 IOPS per machine which<br>

means in total about 32000 IOPS.<br>

<br>

If we use LVM-based backend,RAID5, 8 parallel VMs, we achieve ~11000<br>

IOPS per machine, in total about 90000 IOPS.<br>

<br>

This is a significant difference.<br>

<br>

This test was done about half a year ago by one of our engineers who<br>

no longer works for us but we still do have the box and everything, so<br>

if community is interested I can re-run the tests, again validate<br>

results, do any reconfiguration etc.<br>

<br>

<br>

<br>

> And this was precisely the Oracle argument.  The reason it foundered is<br>

> that most FS complexity goes to manage the data structures ... the I/O<br>

> path can still be made short and fast, as DirectIO demonstrates.  Then<br>

> the management penalty you pay (having to manage all the data<br>

> structures that the filesystem would have managed for you) starts to<br>

> outweigh any minor performance advantages.<br>

<br>

The only thing O_DIRECT does is that it instructs the kernel to skip<br>

filesystem cache for the file opened in this mode. Rest of the<br>

filesystem complexity remains in the IO's datapath. Note for example -<br>

we did a test on file-based backend with BTRFS - results were<br>

absolutely horrible - there's just too much stuff filesystem has to do<br>

when processing IOs and we believe a lot of it is just not necessary<br>

when the storage is actually used to only store virtual disks.<br>

<br>

Anyway, I am really glad that you brought these views, we are happy to<br>

reconsider our decisions so let's have a discussion I am sure we<br>

missed many things when we were evaluating both backends.<br>

<br>

One more question: What about the Cinder? I think they are using LVM<br>

for storing volumes, right? Why they don't use files?<br>

<br>

Thanks,<br>

Prema<br>

</blockquote></div>