<p dir="ltr">Hi Matt, James,</p>
<p dir="ltr">any thoughts on the below notes?</p>
<p dir="ltr">Best Regards,<br>
Prema</p>
<div class="gmail_quote">On 19 Jan 2016 20:47, "Premysl Kouril" <<a href="mailto:premysl.kouril@gmail.com">premysl.kouril@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi James,<br>
<br>
<br>
><br>
> You still haven't answered Anita's question: when you say "sponsor" do<br>
> you mean provide resources to existing developers to work on your<br>
> feature or provide new developers.<br>
><br>
<br>
I did, I am copy-pasting my response to Anita here again:<br>
<br>
Both. We are first trying this "Are you asking for current Nova<br>
developers to work on this feature?" and if we won't find anybody we<br>
will start with "your company interested in having your developers<br>
interact with Nova developers"<br>
<br>
<br>
><br>
> Heh, this is history repeating itself from over a decade ago when<br>
> Oracle would have confidently told you that Linux had to have raw<br>
> devices because that's the only way a database will perform. Fast<br>
> forward to today and all oracle databases use file backends.<br>
><br>
> Simplicity is also in the eye of the beholder. LVM has a very simple<br>
> naming structure whereas filesystems have complex hierarchical ones.<br>
> Once you start trying to scale to millions of instances, you'll find<br>
> there's quite a management penalty for the LVM simplicity.<br>
<br>
We won't definitely have millions instances on hypervisors but we can<br>
certainly have applications demanding million IOPS (in sum) from<br>
hypervisor in near future.<br>
<br>
><br>
>> It seems from our benchmarks that LVM behavior when<br>
>> processing many IOPs (10s of thousands) is more stable than if<br>
>> filesystem is used as backend.<br>
><br>
> It sounds like you haven't enabled directio here ... that was the<br>
> solution to the oracle issue.<br>
<br>
<br>
If you mean O_DIRECT mode then we had than one during our benchmarks.<br>
Here is our benchmark setup and results:<br>
<br>
testing box configuration:<br>
<br>
CPU: 4x E7-8867 v3 (total of 64 physical cores)<br>
RAM: 1TB<br>
Storage: 12x enteprise class SSD disks (each disk 140 000/120 000<br>
IOPS read/write)<br>
disks connected via 12Gb/s SAS3 lanes<br>
<br>
So we are using big boxes which can run quite a lot of VMs.<br>
<br>
Out of the disks we create linux md raid (we did raid5 and raid10)<br>
and do some fine tuning:<br>
<br>
1) echo 8 > /sys/block/md127/md/group_thread_cnt - this increases<br>
parallelism for raid5<br>
2) we boot kernel with scsi_mod.use_blk_mq=Y to active block io multi queueing<br>
3) we increase size of caching (for raid5)<br>
<br>
On that raid we either create LVM group or filesystem depending if we<br>
are testing LVM nova backend or file-based nova backend.<br>
<br>
<br>
On this hypervisor we run nova/kvm and we provision 10-20 VMs and we<br>
run benchmark tests from these VMs and we are trying to saturate IO on<br>
hypervisor.<br>
<br>
We use following command running inside the VMs:<br>
<br>
fio --randrepeat=1 --ioengine=libaio --direct=1 -gtod_reduce=1<br>
--name=test1 --bs=4k --iodepth=256 --size=20G --numjobs=1<br>
--readwrite=randwrite<br>
<br>
So you can see that in the guest OS we use --direct=1 which causes the<br>
test file to be opened with O_DIRECT. Actually I am now not sure but<br>
if using file-based backend then I hope that the virtual disk is<br>
automatically opened with O_DIRECT and that it is done by libvirt/qemu<br>
by default without any explicit configuration.<br>
<br>
Anyway, with this we have following results:<br>
<br>
If we use file-based backend in Nova, ext4 filesystem and RAID5 then<br>
in 8 parallel VMs we were able to achieve ~3000 IOPS per machine which<br>
means in total about 32000 IOPS.<br>
<br>
If we use LVM-based backend,RAID5, 8 parallel VMs, we achieve ~11000<br>
IOPS per machine, in total about 90000 IOPS.<br>
<br>
This is a significant difference.<br>
<br>
This test was done about half a year ago by one of our engineers who<br>
no longer works for us but we still do have the box and everything, so<br>
if community is interested I can re-run the tests, again validate<br>
results, do any reconfiguration etc.<br>
<br>
<br>
<br>
> And this was precisely the Oracle argument. The reason it foundered is<br>
> that most FS complexity goes to manage the data structures ... the I/O<br>
> path can still be made short and fast, as DirectIO demonstrates. Then<br>
> the management penalty you pay (having to manage all the data<br>
> structures that the filesystem would have managed for you) starts to<br>
> outweigh any minor performance advantages.<br>
<br>
The only thing O_DIRECT does is that it instructs the kernel to skip<br>
filesystem cache for the file opened in this mode. Rest of the<br>
filesystem complexity remains in the IO's datapath. Note for example -<br>
we did a test on file-based backend with BTRFS - results were<br>
absolutely horrible - there's just too much stuff filesystem has to do<br>
when processing IOs and we believe a lot of it is just not necessary<br>
when the storage is actually used to only store virtual disks.<br>
<br>
Anyway, I am really glad that you brought these views, we are happy to<br>
reconsider our decisions so let's have a discussion I am sure we<br>
missed many things when we were evaluating both backends.<br>
<br>
One more question: What about the Cinder? I think they are using LVM<br>
for storing volumes, right? Why they don't use files?<br>
<br>
Thanks,<br>
Prema<br>
</blockquote></div>