<div dir="ltr"><div>@Vahric, FYI, if you use directio, instead of sync (like a database is is default configured for), you will just be using the RBD cache. Look at the latency on your numbers. It is lower than is possible for a packet to traverse the network. You'll need to use sync=1 if you want to see what the performance is like for sync writes. You can reduce it with higher CPU frequencies (change the governor), c-state disable, better network, the right NVMe for journal, and other stuff. In the end, we're happy to see even 500-600 IOPS for sync writes with a numjobs=1, iodepth=1 (256 is unreasonable).<br><br></div><div>@Luis, since this is an OpenStack list, I assume he is accessing it via Cinder.<br><br></div>Warren<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Feb 17, 2017 at 7:11 AM, Luis Periquito <span dir="ltr"><<a href="mailto:periquito@gmail.com" target="_blank">periquito@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">There is quite some information missing: how much RAM do the nodes<br>

have? What SSDs? What Kernel (there has been complaints of a<br>

performance regression on 4.4+).<br>

<br>

You also never state how you have configured the OSDs, their journals,<br>

filestore or bluestore, etc...<br>

<br>

You never specify how you're accessing the RBD device...<br>

<br>

For you to achieve high IOPS you need higher frequency CPUs. Also you<br>

have to remember that the scale-out architecture of ceph means the<br>

more nodes you add the better performance you'll have.<br>

<div><div class="h5"><br>

On Thu, Feb 16, 2017 at 4:26 PM, Vahric Muhtaryan <<a href="mailto:vahric@doruk.net.tr">vahric@doruk.net.tr</a>> wrote:<br>

> Hello All ,<br>

><br>

> For a long time we are testing Ceph from Firefly to Kraken , tried to<br>

> optimise many things which are very very common I guess like test tcmalloc<br>

> version 2.1 , 2,4 , jemalloc , setting debugs 0/0 , op_tracker and such<br>

> others and I believe with out hardware we almost reach to end of the road.<br>

><br>

> Some vendor tests mixed us a lot like samsung<br>

> <a href="http://www.samsung.com/semiconductor/support/tools-utilities/All-Flash-Array-Reference-Design/downloads/Samsung_NVMe_SSDs_and_Red_Hat_Ceph_Storage_CS_20160712.pdf" rel="noreferrer" target="_blank">http://www.samsung.com/<wbr>semiconductor/support/tools-<wbr>utilities/All-Flash-Array-<wbr>Reference-Design/downloads/<wbr>Samsung_NVMe_SSDs_and_Red_Hat_<wbr>Ceph_Storage_CS_20160712.pdf</a><br>

> , DELL Dell PowerEdge R730xd Performance and Sizing Guide for Red Hat … and<br>

> from intel<br>

> <a href="http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2015/20150813_S303E_Zhang.pdf" rel="noreferrer" target="_blank">http://www.flashmemorysummit.<wbr>com/English/Collaterals/<wbr>Proceedings/2015/20150813_<wbr>S303E_Zhang.pdf</a><br>

><br>

> At the end using 3 replica (Actually most of vendors are testing with 2 but<br>

> I believe that its very very wrong way to do because when some of failure<br>

> happen you should wait 300 sec which is configurable but from blogs we<br>

> understaood that sometimes OSDs can be down and up again because of that I<br>

> believe very important to set related number but we do not want instances<br>

> freeze )  with config below with 4K , random and fully write only .<br>

><br>

> I red a lot about OSD and OSD process eating huge CPU , yes it is and we are<br>

> very well know that we couldn’t get total of iOPS capacity of each raw SSD<br>

> drives.<br>

><br>

> My question is , can you pls share almost same or closer config or any<br>

> config test or production results ? Key is write, not %70 of read % 30 write<br>

> or full read things …<br>

><br>

> Hardware :<br>

><br>

> 6 x Node<br>

> Each Node  Have :<br>

> 2 Socker CPU 1.8 GHZ each and total 16 core<br>

> 3 SSD + 12 HDD (SSDs are in journal mode 4 HDD to each SSD)<br>

> Raid Cards Configured Raid 0<br>

> We did not see any performance different with JBOD mode of raid card because<br>

> of that continued with raid 0<br>

> Also raid card write back cache is used because its adding extra IOPS too !<br>

><br>

> Achieved IOPS : 35 K (Single Client)<br>

> We tested up to 10 Clients which ceph fairly share this usage like almost 4K<br>

> for each<br>

><br>

> Test Command : fio --randrepeat=1 --ioengine=libaio --direct=1<br>

> --gtod_reduce=1 --name=test --filename=test --bs=4k —iodepth=256 --size=1G<br>

> --numjobs=8 --readwrite=randwrite —group_reporting<br>

><br>

><br>

> Regards<br>

> Vahric Muhtaryan<br>

><br>

</div></div>> ______________________________<wbr>_________________<br>

> OpenStack-operators mailing list<br>

> <a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.<wbr>openstack.org</a><br>

> <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-operators</a><br>

><br>

<br>

______________________________<wbr>_________________<br>

OpenStack-operators mailing list<br>

<a href="mailto:OpenStack-operators@lists.openstack.org">OpenStack-operators@lists.<wbr>openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-operators</a><br>

</blockquote></div><br></div>