Hi, We've actually hit latency with local disks (image based storage) this week and I've performed multiple benchmarks with various options. Our goal is to have best latency / IOPS with random synchronous 8K writes on local NVME with queue depth of 1 (this is what our DB is doing). Our writes are synchronous so my numbers will be much lower then your 4K random writes. Our hardware: 2x INTEL Xeon Silver 4214R 16x 16GB DDR4 2x NVME - WD SN630 3.2TB in RAID0 (using LVM) VM is Debian 9 image with hw_disk_bus=scsi set in metadata With our setup we started with 4800 IOPS and ~0.3ms latency with standard settings and went to 17,8K IOPS with ~0.054ms latency after some optimizations. Here is what settings resulted in different performance data: - change I/O scheduler to noop in VM (echo 'noop' > /sys/block/sda/queue/scheduler) - set scaling_governor=performance on the compute host from default "schedutil". I've noticed this is most significant change with queue depth of 1 when there is no other load on the host. Alternatively putting artificial CPU load on the VM while running the benchmark also improves I/O latency. I guess keeping CPU clocks higher wither with CPU scheduler setting or artificial CPU usage have significant impact. This may also prevent CPU from going into deeper C-states but I did not investigate that further. - Setting io='native' in libvirt configuration. This is set automatically in OpenStack when you use preallocated images (https://docs.openstack.org/nova/xena/configuration/config.html#DEFAULT.preal...) - Use LVM-backed images instead of thin provisioning qcow2 as you've already tried - Change the "bus" parameter to "virtio" instead of scsi. I did not performed benchmark with all those changes combined because we achieved required performance. After that we only set I/O scheduler to noop, and will probably relay on CPU load in production performance to keep the CPU busy and prevent going to deeper C-states and lower the CPU clock. On 07.01.2022 04:54, Eric K. Miller wrote:
Hi Laurent,
I thought I may have already done some benchmarks, and it looks like I did, long ago, for the discussion that I created a couple years ago (on August 6, 2020 to be exact).
I copied the results from that email below. You can see that the latency difference is pretty significant (13.75x with random 4KiB reads) between bare metal and a VM, which is about the same as the difference in IOPS. Writes are not quite as bad of difference at 8.4x.
Eric
Some numbers from fio, just to get an idea for how good/bad the IOPS will be:
Configuration: 32 core EPYC 7502P with 512GiB of RAM - CentOS 7 latest updates - Kolla Ansible (Stein) deployment 32 vCPU VM with 64GiB of RAM 32 x 10GiB test files (I'm using file tests, not raw device tests, so not optimal, but easiest when the VM root disk is the test disk) iodepth=10 numofjobs=32 time=30 (seconds)
The VM was deployed using a qcow2 image, then deployed as a raw image, to see the difference in performance. There was none, which makes sense, since I'm pretty sure the qcow2 image was decompressed and stored in the LVM logical volume - so both tests were measuring the same thing.
Bare metal (random 4KiB reads): 8066MiB/sec 154.34 microsecond avg latency 2.065 million IOPS
VM qcow2 (random 4KiB reads): 589MiB/sec 2122.10 microsecond avg latency 151k IOPS
Bare metal (random 4KiB writes): 4940MiB/sec 252.44 microsecond avg latency 1.265 million IOPS
VM qcow2 (random 4KiB writes): 589MiB/sec 2119.16 microsecond avg latency 151k IOPS
Since the read and write VM results are nearly identical, my assumption is that the emulation layer is the bottleneck. CPUs in the VM were all at 55% utilization (all kernel usage). The qemu process on the bare metal machine indicated 1600% (or so) CPU utilization.
Below are runs with sequential 1MiB block tests
Bare metal (sequential 1MiB reads): 13.3GiB/sec 23446.43 microsecond avg latency 13.7k IOPS
VM qcow2 (sequential 1MiB reads): 8378MiB/sec 38164.52 microsecond avg latency 8377 IOPS
Bare metal (sequential 1MiB writes): 8098MiB/sec 39488.00 microsecond avg latency 8097 million IOPS
VM qcow2 (sequential 1MiB writes): 8087MiB/sec 39534.96 microsecond avg latency 8087 IOPS
-- Damian Pietras HardIT