Open Stack

Wed Jan 12 17:14:31 UTC 2022

Hi,

We've actually hit latency with local disks (image based storage) this 
week and I've performed multiple benchmarks with various options. Our 
goal is to have best latency / IOPS with random synchronous 8K writes on 
local NVME with queue depth of 1 (this is what our DB is doing).
Our writes are synchronous so my numbers will be much lower then your 4K 
random writes.

Our hardware:
2x INTEL Xeon Silver 4214R
16x 16GB DDR4
2x NVME - WD SN630 3.2TB in RAID0 (using LVM)
VM is Debian 9 image with hw_disk_bus=scsi set in metadata

With our setup we started with 4800 IOPS and ~0.3ms latency with 
standard settings and went to 17,8K IOPS with ~0.054ms latency after 
some optimizations. Here is what settings resulted in different 
performance data:

- change I/O scheduler to noop in VM (echo 'noop' > 
/sys/block/sda/queue/scheduler)

- set scaling_governor=performance on the compute host from default 
"schedutil". I've noticed this is most significant change with queue 
depth of 1 when there is no other load on the host. Alternatively 
putting artificial CPU load on the VM while running the benchmark also 
improves I/O latency. I guess keeping CPU clocks higher wither with CPU 
scheduler setting or artificial CPU usage have significant impact. This 
may also prevent CPU from going into deeper C-states but I did not 
investigate that further.

- Setting io='native' in libvirt configuration. This is set 
automatically in OpenStack when you use preallocated images 
(https://docs.openstack.org/nova/xena/configuration/config.html#DEFAULT.preallocate_images)

- Use LVM-backed images instead of thin provisioning qcow2 as you've 
already tried

- Change the "bus" parameter to "virtio" instead of scsi.

I did not performed benchmark with all those changes combined because we 
achieved required performance. After that we only set I/O scheduler to 
noop, and will probably relay on CPU load in production performance to 
keep the CPU busy and prevent going to deeper C-states and lower the CPU 
clock.

On 07.01.2022 04:54, Eric K. Miller wrote:
> Hi Laurent,
> 
> I thought I may have already done some benchmarks, and it looks like I did, long ago, for the discussion that I created a couple years ago (on August 6, 2020 to be exact).
> 
> I copied the results from that email below.  You can see that the latency difference is pretty significant (13.75x with random 4KiB reads) between bare metal and a VM, which is about the same as the difference in IOPS.  Writes are not quite as bad of difference at 8.4x.
> 
> Eric
> 
> 
> Some numbers from fio, just to get an idea for how good/bad the IOPS will be:
> 
> Configuration:
> 32 core EPYC 7502P with 512GiB of RAM - CentOS 7 latest updates - Kolla Ansible (Stein) deployment
> 32 vCPU VM with 64GiB of RAM
> 32 x 10GiB test files (I'm using file tests, not raw device tests, so not optimal, but easiest when the VM root disk is the test disk)
> iodepth=10
> numofjobs=32
> time=30 (seconds)
> 
> The VM was deployed using a qcow2 image, then deployed as a raw image, to see the difference in performance.  There was none, which makes sense, since I'm pretty sure the qcow2 image was decompressed and stored in the LVM logical volume - so both tests were measuring the same thing.
> 
> Bare metal (random 4KiB reads):
> 8066MiB/sec
> 154.34 microsecond avg latency
> 2.065 million IOPS
> 
> VM qcow2 (random 4KiB reads):
> 589MiB/sec
> 2122.10 microsecond avg latency
> 151k IOPS
> 
> Bare metal (random 4KiB writes):
> 4940MiB/sec
> 252.44 microsecond avg latency
> 1.265 million IOPS
> 
> VM qcow2 (random 4KiB writes):
> 589MiB/sec
> 2119.16 microsecond avg latency
> 151k IOPS
> 
> Since the read and write VM results are nearly identical, my assumption is that the emulation layer is the bottleneck.  CPUs in the VM were all at 55% utilization (all kernel usage).  The qemu process on the bare metal machine indicated 1600% (or so) CPU utilization.
> 
> Below are runs with sequential 1MiB block tests
> 
> Bare metal (sequential 1MiB reads):
> 13.3GiB/sec
> 23446.43 microsecond avg latency
> 13.7k IOPS
> 
> VM qcow2 (sequential 1MiB reads):
> 8378MiB/sec
> 38164.52 microsecond avg latency
> 8377 IOPS
> 
> Bare metal (sequential 1MiB writes):
> 8098MiB/sec
> 39488.00 microsecond avg latency
> 8097 million IOPS
> 
> VM qcow2 (sequential 1MiB writes):
> 8087MiB/sec
> 39534.96 microsecond avg latency
> 8087 IOPS

-- 
Damian Pietras
HardIT

Open Stack

[nova] iothread support with Libvirt

OpenStack

Community

Documentation

Branding & Legal