[nova] iothread support with Libvirt
Damian Pietras
damian.pietras at hardit.pl
Wed Jan 12 17:14:31 UTC 2022
Hi,
We've actually hit latency with local disks (image based storage) this
week and I've performed multiple benchmarks with various options. Our
goal is to have best latency / IOPS with random synchronous 8K writes on
local NVME with queue depth of 1 (this is what our DB is doing).
Our writes are synchronous so my numbers will be much lower then your 4K
random writes.
Our hardware:
2x INTEL Xeon Silver 4214R
16x 16GB DDR4
2x NVME - WD SN630 3.2TB in RAID0 (using LVM)
VM is Debian 9 image with hw_disk_bus=scsi set in metadata
With our setup we started with 4800 IOPS and ~0.3ms latency with
standard settings and went to 17,8K IOPS with ~0.054ms latency after
some optimizations. Here is what settings resulted in different
performance data:
- change I/O scheduler to noop in VM (echo 'noop' >
/sys/block/sda/queue/scheduler)
- set scaling_governor=performance on the compute host from default
"schedutil". I've noticed this is most significant change with queue
depth of 1 when there is no other load on the host. Alternatively
putting artificial CPU load on the VM while running the benchmark also
improves I/O latency. I guess keeping CPU clocks higher wither with CPU
scheduler setting or artificial CPU usage have significant impact. This
may also prevent CPU from going into deeper C-states but I did not
investigate that further.
- Setting io='native' in libvirt configuration. This is set
automatically in OpenStack when you use preallocated images
(https://docs.openstack.org/nova/xena/configuration/config.html#DEFAULT.preallocate_images)
- Use LVM-backed images instead of thin provisioning qcow2 as you've
already tried
- Change the "bus" parameter to "virtio" instead of scsi.
I did not performed benchmark with all those changes combined because we
achieved required performance. After that we only set I/O scheduler to
noop, and will probably relay on CPU load in production performance to
keep the CPU busy and prevent going to deeper C-states and lower the CPU
clock.
On 07.01.2022 04:54, Eric K. Miller wrote:
> Hi Laurent,
>
> I thought I may have already done some benchmarks, and it looks like I did, long ago, for the discussion that I created a couple years ago (on August 6, 2020 to be exact).
>
> I copied the results from that email below. You can see that the latency difference is pretty significant (13.75x with random 4KiB reads) between bare metal and a VM, which is about the same as the difference in IOPS. Writes are not quite as bad of difference at 8.4x.
>
> Eric
>
>
> Some numbers from fio, just to get an idea for how good/bad the IOPS will be:
>
> Configuration:
> 32 core EPYC 7502P with 512GiB of RAM - CentOS 7 latest updates - Kolla Ansible (Stein) deployment
> 32 vCPU VM with 64GiB of RAM
> 32 x 10GiB test files (I'm using file tests, not raw device tests, so not optimal, but easiest when the VM root disk is the test disk)
> iodepth=10
> numofjobs=32
> time=30 (seconds)
>
> The VM was deployed using a qcow2 image, then deployed as a raw image, to see the difference in performance. There was none, which makes sense, since I'm pretty sure the qcow2 image was decompressed and stored in the LVM logical volume - so both tests were measuring the same thing.
>
> Bare metal (random 4KiB reads):
> 8066MiB/sec
> 154.34 microsecond avg latency
> 2.065 million IOPS
>
> VM qcow2 (random 4KiB reads):
> 589MiB/sec
> 2122.10 microsecond avg latency
> 151k IOPS
>
> Bare metal (random 4KiB writes):
> 4940MiB/sec
> 252.44 microsecond avg latency
> 1.265 million IOPS
>
> VM qcow2 (random 4KiB writes):
> 589MiB/sec
> 2119.16 microsecond avg latency
> 151k IOPS
>
> Since the read and write VM results are nearly identical, my assumption is that the emulation layer is the bottleneck. CPUs in the VM were all at 55% utilization (all kernel usage). The qemu process on the bare metal machine indicated 1600% (or so) CPU utilization.
>
> Below are runs with sequential 1MiB block tests
>
> Bare metal (sequential 1MiB reads):
> 13.3GiB/sec
> 23446.43 microsecond avg latency
> 13.7k IOPS
>
> VM qcow2 (sequential 1MiB reads):
> 8378MiB/sec
> 38164.52 microsecond avg latency
> 8377 IOPS
>
> Bare metal (sequential 1MiB writes):
> 8098MiB/sec
> 39488.00 microsecond avg latency
> 8097 million IOPS
>
> VM qcow2 (sequential 1MiB writes):
> 8087MiB/sec
> 39534.96 microsecond avg latency
> 8087 IOPS
--
Damian Pietras
HardIT
More information about the openstack-discuss
mailing list