Re: [nova] iothread support with Libvirt

12 Jan 2022

      Hi,

We've actually hit latency with local disks (image based storage) this 
week and I've performed multiple benchmarks with various options. Our 
goal is to have best latency / IOPS with random synchronous 8K writes on 
local NVME with queue depth of 1 (this is what our DB is doing).
Our writes are synchronous so my numbers will be much lower then your 4K 
random writes.

Our hardware:
2x INTEL Xeon Silver 4214R
16x 16GB DDR4
2x NVME - WD SN630 3.2TB in RAID0 (using LVM)
VM is Debian 9 image with hw_disk_bus=scsi set in metadata

With our setup we started with 4800 IOPS and ~0.3ms latency with 
standard settings and went to 17,8K IOPS with ~0.054ms latency after 
some optimizations. Here is what settings resulted in different 
performance data:

- change I/O scheduler to noop in VM (echo 'noop' > 
/sys/block/sda/queue/scheduler)

- set scaling_governor=performance on the compute host from default 
"schedutil". I've noticed this is most significant change with queue 
depth of 1 when there is no other load on the host. Alternatively 
putting artificial CPU load on the VM while running the benchmark also 
improves I/O latency. I guess keeping CPU clocks higher wither with CPU 
scheduler setting or artificial CPU usage have significant impact. This 
may also prevent CPU from going into deeper C-states but I did not 
investigate that further.

- Setting io='native' in libvirt configuration. This is set 
automatically in OpenStack when you use preallocated images 
(https://docs.openstack.org/nova/xena/configuration/config.html#DEFAULT.preal...)

- Use LVM-backed images instead of thin provisioning qcow2 as you've 
already tried

- Change the "bus" parameter to "virtio" instead of scsi.

I did not performed benchmark with all those changes combined because we 
achieved required performance. After that we only set I/O scheduler to 
noop, and will probably relay on CPU load in production performance to 
keep the CPU busy and prevent going to deeper C-states and lower the CPU 
clock.

On 07.01.2022 04:54, Eric K. Miller wrote:
...
Hi Laurent,
I thought I may have already done some benchmarks, and it looks like I did, long ago, for the discussion that I created a couple years ago (on August 6, 2020 to be exact).
I copied the results from that email below.  You can see that the latency difference is pretty significant (13.75x with random 4KiB reads) between bare metal and a VM, which is about the same as the difference in IOPS.  Writes are not quite as bad of difference at 8.4x.
Eric
Some numbers from fio, just to get an idea for how good/bad the IOPS will be:
Configuration:
32 core EPYC 7502P with 512GiB of RAM - CentOS 7 latest updates - Kolla Ansible (Stein) deployment
32 vCPU VM with 64GiB of RAM
32 x 10GiB test files (I'm using file tests, not raw device tests, so not optimal, but easiest when the VM root disk is the test disk)
iodepth=10
numofjobs=32
time=30 (seconds)
The VM was deployed using a qcow2 image, then deployed as a raw image, to see the difference in performance.  There was none, which makes sense, since I'm pretty sure the qcow2 image was decompressed and stored in the LVM logical volume - so both tests were measuring the same thing.
Bare metal (random 4KiB reads):
8066MiB/sec
154.34 microsecond avg latency
2.065 million IOPS
VM qcow2 (random 4KiB reads):
589MiB/sec
2122.10 microsecond avg latency
151k IOPS
Bare metal (random 4KiB writes):
4940MiB/sec
252.44 microsecond avg latency
1.265 million IOPS
VM qcow2 (random 4KiB writes):
589MiB/sec
2119.16 microsecond avg latency
151k IOPS
Since the read and write VM results are nearly identical, my assumption is that the emulation layer is the bottleneck.  CPUs in the VM were all at 55% utilization (all kernel usage).  The qemu process on the bare metal machine indicated 1600% (or so) CPU utilization.
Below are runs with sequential 1MiB block tests
Bare metal (sequential 1MiB reads):
13.3GiB/sec
23446.43 microsecond avg latency
13.7k IOPS
VM qcow2 (sequential 1MiB reads):
8378MiB/sec
38164.52 microsecond avg latency
8377 IOPS
Bare metal (sequential 1MiB writes):
8098MiB/sec
39488.00 microsecond avg latency
8097 million IOPS
VM qcow2 (sequential 1MiB writes):
8087MiB/sec
39534.96 microsecond avg latency
8087 IOPS
-- 
Damian Pietras
HardIT