[nova] iothread support with Libvirt
Hi, I haven't found anything that indicates Nova supports adding iothreads parameters to the Libvirt XML file. I had asked various performance related questions a couple years back, including asking if iothreads were available, but I didn't get any response (so assumed the answer was no). So I'm just checking again to see if this has been a consideration to help improve a VM's storage performance - specifically with extremely high-speed storage in the host. Or is there a way to add iothread-related parameters without Nova being involved (such as modifying a template)? Thanks! Eric
On Thu, 2022-01-06 at 00:12 -0600, Eric K. Miller wrote:
Hi,
I haven't found anything that indicates Nova supports adding iothreads parameters to the Libvirt XML file. I had asked various performance related questions a couple years back, including asking if iothreads were available, but I didn't get any response (so assumed the answer was no). So I'm just checking again to see if this has been a consideration to help improve a VM's storage performance - specifically with extremely high-speed storage in the host. hi up until recently the advice from our virt team was that iotread where not really needed for openstack howver in the last 6 weeks they have actully asked us to consider enabling them
so work will be happening in qemu/libvirt to always create at least one iothread going forward and affinies it to the same set of cores as the emulator threads by default. we dont have a downstream rfe currently filed for ioithread specifically but we do virtio scsi multi queue support https://bugzilla.redhat.com/show_bug.cgi?id=1880273 i was proposing that we also enable iotread support as part of that work but we have not currently internaly piroited it for any upstream release. enable support for iotrhead and virtio multiqueue i think makes a lot of sense to do together. my understanding is that without iothread multi queue virtio scsi does not provide as much of a perfromace boost as with io threads. if you our other have capasity to work on this i would be happy to work on a spec with ye to enable it. effectivly what i was plannign to propose if we got around to it is adding a new config option cpu_iothread_set which would default to the same value as cpu_share_set. this effectivly will ensure that witout any config updates all existing deployment will start benifiting form iothreads and allow you to still dedicate a set of cores to running the iothread seperate form the cpu_share_set if you wasnt this to also benifit floating vms not just pinned vms. in addtion to that a new flavor extra spec/image property woudl be added similar to cpu_emultor_threads. im not quite sure how that extra spec should work but either hw:cpu_iotread_policy woudl either support the same vales as hw:cpu_emulator_threads where hw:cpu_iotread_policy=shared woudl allocate an iotread that floats over the cpu_iothread_set (which is the same as cpu_shared_set by default) and hw:cpu_iotread_policy=isolate would allocate an addtional iothread form the cpu_dedicated_set. hw:cpu_iotread_policy=share woudl be the default behavior if cpu_shared_set or cpu_iothread_set was defined in the config and not flavor extra spec or image property was defiend. basically all vms woudl have at least 1 iothread that floated over teh shared pool if a share pool was configured on the host. that is option a option b woudl be to allso support hw:cpu_iotread_count so you could ask for n iothread eitehr form the shared/iothread set or dedicated set depending on the value of hw:cpu_iotread_policy im not really sure if there is a need for more the 1 io thread. my understanding is that once you have at least 1 there is demising retruns. it will improve your perfoamce if you have more propvided you have multiple disks/volumes attached but not as much as having the initall iotread. is this something you wold be willing to work on and implement? i woudl be happy to review any spec in this areay and i can bring it up downstream again but i cant commit to working on this in the z release. this would require some minor rpc chagnes to ensure live migrate work properly as the iothread set or cpu share set could be different on different hosts. but beyond that the feature is actully pretty simple to enable.
Or is there a way to add iothread-related parameters without Nova being involved (such as modifying a template)?
no there is no way to enable them out of band of nova today. you technially could wrap the qemu binary wiht a script that inject parmaters but that obviously would not be supported upstream. but that would be a workaround if you really needed it https://review.opendev.org/c/openstack/devstack/+/817075 is an exmaple of such a script that break apparmor and selinx but you could proably make it work with enough effort. although i woudl sugess just implemeting the feature upstream and downing a downstream backport instead.
Thanks!
Eric
Hi Sean, Thanks, as always, for your reply!
hi up until recently the advice from our virt team was that iotread where not really needed for openstack howver in the last 6 weeks they have actully asked us to consider enabling them
I don't have the data to know whether iothread improves performance or not. Rather, I made the assumption that a dedicated core for I/O would likely perform much better than without. If someone has any data on this, it would be extremely useful. The issue we are trying to resolve is related to high-speed local storage performance that is literally 10x, and sometimes 15x, slower in a VM than the host. The local storage can reach upwards of 8GiB/sec and 1 million IOPS. It's not necessarily throughput we're after, though - it is latency, and the high latency in QEMU/KVM is simply too high to get adequate storage performance inside a VM. If iothread(s) do not help, then the point of implementing the parameter in Nova is probably moot.
so work will be happening in qemu/libvirt to always create at least one iothread going forward and affinies it to the same set of cores as the emulator threads by default.
That sounds like a good idea, although I did read somewhere in the QEMU docs that not all drivers support iothreads, and trying to use them with unsupported drivers will likely crash QEMU - but I don't know how old those docs were. It seems reasonable since the "big QEMU lock" is not being used for the io thread(s).
we dont have a downstream rfe currently filed for ioithread specifically but we do virtio scsi multi queue support https://bugzilla.redhat.com/show_bug.cgi?id=1880273
I found this old blueprint and implementation (that apparently was never accepted due to tests failing in various environments): https://blueprints.launchpad.net/nova/+spec/libvirt-iothreads https://review.opendev.org/c/openstack/nova/+/384871/
to do together. my understanding is that without iothread multi queue virtio scsi does not provide as much of a perfromace boost as with io threads.
I can imagine that being the case - since a spinning loop has super-low latency compared to an interrupt.
if you our other have capasity to work on this i would be happy to work on a spec with ye to enable it.
I wish I had the bandwidth to learn how, but since I'm not a Python developer, nor have a development environment ready to go (I'm mostly performing cloud operator and business support functions), I probably couldn't help much other than provide feedback.
effectivly what i was plannign to propose if we got around to it is adding a new config option cpu_iothread_set which would default to the same value as cpu_share_set. this effectivly will ensure that witout any config updates all existing deployment will start benifiting form iothreads and allow you to still dedicate a set of cores to running the iothread seperate form the cpu_share_set if you wasnt this to also benifit floating vms not just pinned vms.
I would first suggest asking the QEMU folks whether there are incompatibilities with iothreads with storage drivers that could cause issues by enabling iothreads by default. I suggest a more cautionary approach and leave the default as-is and allow a user to enable iothreads themselves. The default could always be changed later if there isn't any negative feedback from those who tried using iothreads.
in addtion to that a new flavor extra spec/image property woudl be added similar to cpu_emultor_threads.
im not quite sure how that extra spec should work but either hw:cpu_iotread_policy woudl either support the same vales as hw:cpu_emulator_threads where hw:cpu_iotread_policy=shared woudl allocate an iotread that floats over the cpu_iothread_set (which is the same as cpu_shared_set by default) and hw:cpu_iotread_policy=isolate would allocate an addtional iothread form the cpu_dedicated_set. hw:cpu_iotread_policy=share woudl be the default behavior if cpu_shared_set or cpu_iothread_set was defined in the config and not flavor extra spec or image property was defiend. basically all vms woudl have at least 1 iothread that floated over teh shared pool if a share pool was configured on the host.
I will have to review this more carefully, when I have a bit more time.
that is option a option b woudl be to allso support
hw:cpu_iotread_count so you could ask for n iothread eitehr form the shared/iothread set or dedicated set depending on the value of hw:cpu_iotread_policy
im not really sure if there is a need for more the 1 io thread. my understanding is that once you have at least 1 there is demising retruns. it will improve your perfoamce if you have more propvided you have multiple disks/volumes attached but not as much as having the initall iotread.
I would guess that multiple io threads would benefit multiple VMs, where each VM would use its own I/O thread/dedicated core. So, I think providing the possibility for multiple iothreads should be considered, with assignment of these threads to individual VMs. However, this brings up a significantly more complex resource allocation requirement, much less resource allocation during live migration.
is this something you wold be willing to work on and implement? i woudl be happy to review any spec in this areay and i can bring it up downstream again but i cant commit to working on this in the z release. this would require some minor rpc chagnes to ensure live migrate work properly as the iothread set or cpu share set could be different on different hosts. but beyond that the feature is actully pretty simple to enable.
I think we need to do some testing to prove the performance benefits first - before spending the time to implement.
no there is no way to enable them out of band of nova today. you technially could wrap the qemu binary wiht a script that inject parmaters but that obviously would not be supported upstream. but that would be a workaround if you really needed it
https://review.opendev.org/c/openstack/devstack/+/817075 is an exmaple of such a script that break apparmor and selinx but you could proably make it work with enough effort. although i woudl sugess just implemeting the feature upstream and downing a downstream backport instead.
Interesting - maybe I can hack this for testing and proof-of-concept purposes. Thanks for the suggestion! I'll see if we can figure out how to test iothreads in our environment where the high-speed local storage exists. Eric
no there is no way to enable them out of band of nova today. you technially could wrap the qemu binary wiht a script that inject parmaters but that obviously would not be supported upstream. but that would be a workaround if you really needed it
https://review.opendev.org/c/openstack/devstack/+/817075 is an exmaple of such a script
I created a modified version of your script to wrap the qemu-kvm executable, but when OpenStack starts the VM, Nova returns: 2022-01-06 16:15:24.758 6 ERROR nova.compute.manager libvirtError: internal error: Failed to probe QEMU binary with QMP: qemu-kvm.orig: -object iothread,id=iothread0: invalid option "-object iothread,id=iothread0" is the first argument. Our Libvirt/QEMU versions are: Compiled against library: libvirt 4.5.0 Using library: libvirt 4.5.0 Using API: QEMU 4.5.0 Running hypervisor: QEMU 2.12.0 I'm pretty sure these versions includes support for iothreads (for both QEMU as well as Libvirt). Is Libvirt doing some form of cross-check on the XML parameters with the running QEMU parameters that is incompatible with the wrapper perhaps? Eric
Out of curiosity, how are you passing the local storage to the VM? I would also assume a performance hit when using a VM but local storage (instead of Ceph, iscsi, nfs) should still perform well? On Thu, Jan 6, 2022 at 5:44 PM Eric K. Miller <emiller@genesishosting.com> wrote:
no there is no way to enable them out of band of nova today. you technially could wrap the qemu binary wiht a script that inject parmaters but that obviously would not be supported upstream. but that would be a workaround if you really needed it
https://review.opendev.org/c/openstack/devstack/+/817075 is an exmaple of such a script
I created a modified version of your script to wrap the qemu-kvm executable, but when OpenStack starts the VM, Nova returns:
2022-01-06 16:15:24.758 6 ERROR nova.compute.manager libvirtError: internal error: Failed to probe QEMU binary with QMP: qemu-kvm.orig: -object iothread,id=iothread0: invalid option
"-object iothread,id=iothread0" is the first argument.
Our Libvirt/QEMU versions are: Compiled against library: libvirt 4.5.0 Using library: libvirt 4.5.0 Using API: QEMU 4.5.0 Running hypervisor: QEMU 2.12.0
I'm pretty sure these versions includes support for iothreads (for both QEMU as well as Libvirt).
Is Libvirt doing some form of cross-check on the XML parameters with the running QEMU parameters that is incompatible with the wrapper perhaps?
Eric
Out of curiosity, how are you passing the local storage to the VM? I would also assume a performance hit when using a VM but local storage (instead of Ceph, iscsi, nfs) should still perform well?
We're using and LVM logical volume on 4 x Micron 9300's in a RAID 0 configuration using md. We're using the standard Nova "image_type=lvm" option. The performance is good, relative to something as slow as Ceph, but the performance hit is pretty significant compared to the host's performance. It is essentially IOPS limited. I can run some tests and provide results if you were interested. Eric
For sure! I would be curious to see the benchmarks. I haven't deployed anything with LVM but I'm surprised that the cost is so high. There is a pretty thin line between the VM --> Libvirt --> Qemu --> LVM. I would expect the performance to be close to baremetal, but lower of course. On Thu, Jan 6, 2022 at 6:51 PM Eric K. Miller <emiller@genesishosting.com> wrote:
Out of curiosity, how are you passing the local storage to the VM? I would also assume a performance hit when using a VM but local storage (instead of Ceph, iscsi, nfs) should still perform well?
We're using and LVM logical volume on 4 x Micron 9300's in a RAID 0 configuration using md. We're using the standard Nova "image_type=lvm" option.
The performance is good, relative to something as slow as Ceph, but the performance hit is pretty significant compared to the host's performance. It is essentially IOPS limited. I can run some tests and provide results if you were interested.
Eric
For sure! I would be curious to see the benchmarks. I haven't deployed anything with LVM but I'm surprised that the cost is so high.
No problem. I'll work on this tonight.
There is a pretty thin line between the VM --> Libvirt --> Qemu --> LVM. I would expect the performance to be close to baremetal, but lower of course.
The bare metal test is using a logical volume on the same LVM volume group as OpenStack, so it is an easy comparison of VM versus bare metal. The fact that the bare metal test is ridiculously fast shows that there is significant latency in QEMU/KVM somewhere. I have seen others that mention they get a maximum of about 190k IOPS, which is still quite a bit, but that is slowly becoming easy to achieve with the latest SSDs, even when writing small random blocks. Eric
Hi Laurent, I thought I may have already done some benchmarks, and it looks like I did, long ago, for the discussion that I created a couple years ago (on August 6, 2020 to be exact). I copied the results from that email below. You can see that the latency difference is pretty significant (13.75x with random 4KiB reads) between bare metal and a VM, which is about the same as the difference in IOPS. Writes are not quite as bad of difference at 8.4x. Eric Some numbers from fio, just to get an idea for how good/bad the IOPS will be: Configuration: 32 core EPYC 7502P with 512GiB of RAM - CentOS 7 latest updates - Kolla Ansible (Stein) deployment 32 vCPU VM with 64GiB of RAM 32 x 10GiB test files (I'm using file tests, not raw device tests, so not optimal, but easiest when the VM root disk is the test disk) iodepth=10 numofjobs=32 time=30 (seconds) The VM was deployed using a qcow2 image, then deployed as a raw image, to see the difference in performance. There was none, which makes sense, since I'm pretty sure the qcow2 image was decompressed and stored in the LVM logical volume - so both tests were measuring the same thing. Bare metal (random 4KiB reads): 8066MiB/sec 154.34 microsecond avg latency 2.065 million IOPS VM qcow2 (random 4KiB reads): 589MiB/sec 2122.10 microsecond avg latency 151k IOPS Bare metal (random 4KiB writes): 4940MiB/sec 252.44 microsecond avg latency 1.265 million IOPS VM qcow2 (random 4KiB writes): 589MiB/sec 2119.16 microsecond avg latency 151k IOPS Since the read and write VM results are nearly identical, my assumption is that the emulation layer is the bottleneck. CPUs in the VM were all at 55% utilization (all kernel usage). The qemu process on the bare metal machine indicated 1600% (or so) CPU utilization. Below are runs with sequential 1MiB block tests Bare metal (sequential 1MiB reads): 13.3GiB/sec 23446.43 microsecond avg latency 13.7k IOPS VM qcow2 (sequential 1MiB reads): 8378MiB/sec 38164.52 microsecond avg latency 8377 IOPS Bare metal (sequential 1MiB writes): 8098MiB/sec 39488.00 microsecond avg latency 8097 million IOPS VM qcow2 (sequential 1MiB writes): 8087MiB/sec 39534.96 microsecond avg latency 8087 IOPS
Super interesting. Thank you. Pretty obvious with the random IO/throughput performance degradation :( Are these NVME/SSD in hardware raid? On Thu, Jan 6, 2022 at 10:54 PM Eric K. Miller <emiller@genesishosting.com> wrote:
Hi Laurent,
I thought I may have already done some benchmarks, and it looks like I did, long ago, for the discussion that I created a couple years ago (on August 6, 2020 to be exact).
I copied the results from that email below. You can see that the latency difference is pretty significant (13.75x with random 4KiB reads) between bare metal and a VM, which is about the same as the difference in IOPS. Writes are not quite as bad of difference at 8.4x.
Eric
Some numbers from fio, just to get an idea for how good/bad the IOPS will be:
Configuration: 32 core EPYC 7502P with 512GiB of RAM - CentOS 7 latest updates - Kolla Ansible (Stein) deployment 32 vCPU VM with 64GiB of RAM 32 x 10GiB test files (I'm using file tests, not raw device tests, so not optimal, but easiest when the VM root disk is the test disk) iodepth=10 numofjobs=32 time=30 (seconds)
The VM was deployed using a qcow2 image, then deployed as a raw image, to see the difference in performance. There was none, which makes sense, since I'm pretty sure the qcow2 image was decompressed and stored in the LVM logical volume - so both tests were measuring the same thing.
Bare metal (random 4KiB reads): 8066MiB/sec 154.34 microsecond avg latency 2.065 million IOPS
VM qcow2 (random 4KiB reads): 589MiB/sec 2122.10 microsecond avg latency 151k IOPS
Bare metal (random 4KiB writes): 4940MiB/sec 252.44 microsecond avg latency 1.265 million IOPS
VM qcow2 (random 4KiB writes): 589MiB/sec 2119.16 microsecond avg latency 151k IOPS
Since the read and write VM results are nearly identical, my assumption is that the emulation layer is the bottleneck. CPUs in the VM were all at 55% utilization (all kernel usage). The qemu process on the bare metal machine indicated 1600% (or so) CPU utilization.
Below are runs with sequential 1MiB block tests
Bare metal (sequential 1MiB reads): 13.3GiB/sec 23446.43 microsecond avg latency 13.7k IOPS
VM qcow2 (sequential 1MiB reads): 8378MiB/sec 38164.52 microsecond avg latency 8377 IOPS
Bare metal (sequential 1MiB writes): 8098MiB/sec 39488.00 microsecond avg latency 8097 million IOPS
VM qcow2 (sequential 1MiB writes): 8087MiB/sec 39534.96 microsecond avg latency 8087 IOPS
Hi, We've actually hit latency with local disks (image based storage) this week and I've performed multiple benchmarks with various options. Our goal is to have best latency / IOPS with random synchronous 8K writes on local NVME with queue depth of 1 (this is what our DB is doing). Our writes are synchronous so my numbers will be much lower then your 4K random writes. Our hardware: 2x INTEL Xeon Silver 4214R 16x 16GB DDR4 2x NVME - WD SN630 3.2TB in RAID0 (using LVM) VM is Debian 9 image with hw_disk_bus=scsi set in metadata With our setup we started with 4800 IOPS and ~0.3ms latency with standard settings and went to 17,8K IOPS with ~0.054ms latency after some optimizations. Here is what settings resulted in different performance data: - change I/O scheduler to noop in VM (echo 'noop' > /sys/block/sda/queue/scheduler) - set scaling_governor=performance on the compute host from default "schedutil". I've noticed this is most significant change with queue depth of 1 when there is no other load on the host. Alternatively putting artificial CPU load on the VM while running the benchmark also improves I/O latency. I guess keeping CPU clocks higher wither with CPU scheduler setting or artificial CPU usage have significant impact. This may also prevent CPU from going into deeper C-states but I did not investigate that further. - Setting io='native' in libvirt configuration. This is set automatically in OpenStack when you use preallocated images (https://docs.openstack.org/nova/xena/configuration/config.html#DEFAULT.preal...) - Use LVM-backed images instead of thin provisioning qcow2 as you've already tried - Change the "bus" parameter to "virtio" instead of scsi. I did not performed benchmark with all those changes combined because we achieved required performance. After that we only set I/O scheduler to noop, and will probably relay on CPU load in production performance to keep the CPU busy and prevent going to deeper C-states and lower the CPU clock. On 07.01.2022 04:54, Eric K. Miller wrote:
Hi Laurent,
I thought I may have already done some benchmarks, and it looks like I did, long ago, for the discussion that I created a couple years ago (on August 6, 2020 to be exact).
I copied the results from that email below. You can see that the latency difference is pretty significant (13.75x with random 4KiB reads) between bare metal and a VM, which is about the same as the difference in IOPS. Writes are not quite as bad of difference at 8.4x.
Eric
Some numbers from fio, just to get an idea for how good/bad the IOPS will be:
Configuration: 32 core EPYC 7502P with 512GiB of RAM - CentOS 7 latest updates - Kolla Ansible (Stein) deployment 32 vCPU VM with 64GiB of RAM 32 x 10GiB test files (I'm using file tests, not raw device tests, so not optimal, but easiest when the VM root disk is the test disk) iodepth=10 numofjobs=32 time=30 (seconds)
The VM was deployed using a qcow2 image, then deployed as a raw image, to see the difference in performance. There was none, which makes sense, since I'm pretty sure the qcow2 image was decompressed and stored in the LVM logical volume - so both tests were measuring the same thing.
Bare metal (random 4KiB reads): 8066MiB/sec 154.34 microsecond avg latency 2.065 million IOPS
VM qcow2 (random 4KiB reads): 589MiB/sec 2122.10 microsecond avg latency 151k IOPS
Bare metal (random 4KiB writes): 4940MiB/sec 252.44 microsecond avg latency 1.265 million IOPS
VM qcow2 (random 4KiB writes): 589MiB/sec 2119.16 microsecond avg latency 151k IOPS
Since the read and write VM results are nearly identical, my assumption is that the emulation layer is the bottleneck. CPUs in the VM were all at 55% utilization (all kernel usage). The qemu process on the bare metal machine indicated 1600% (or so) CPU utilization.
Below are runs with sequential 1MiB block tests
Bare metal (sequential 1MiB reads): 13.3GiB/sec 23446.43 microsecond avg latency 13.7k IOPS
VM qcow2 (sequential 1MiB reads): 8378MiB/sec 38164.52 microsecond avg latency 8377 IOPS
Bare metal (sequential 1MiB writes): 8098MiB/sec 39488.00 microsecond avg latency 8097 million IOPS
VM qcow2 (sequential 1MiB writes): 8087MiB/sec 39534.96 microsecond avg latency 8087 IOPS
-- Damian Pietras HardIT
Hi Damian,
With our setup we started with 4800 IOPS and ~0.3ms latency with standard settings and went to 17,8K IOPS with ~0.054ms latency after some optimizations. Here is what settings resulted in different performance data:
- change I/O scheduler to noop in VM (echo 'noop' > /sys/block/sda/queue/scheduler)
Thank you for the info! It appears that the "noop" scheduler merges requests, so you are likely getting between 3 and 4 I/O command merges per command to go from 4800 to 17800 IOPS. I'll have to check on this to see if that changes anything on this end, since I thought that the default scheduler also performed command merging. Regarding sleep states, you may want to look at the power management functions in the BIOS. If you have "energy efficient" settings, this will definitely have an impact on latency, but as you noticed, the governor can also override some of these sleep states if you set it to performance. We did a little more testing with iothreads on our Proxmox systems, since it is easy to enable/disable this on a virtual disk. The performance difference on both a relatively idle compute node and VM is extremely small (barely noticeable). With a busy VM, it may make a difference, but we haven't had time to test. So, all the work involved in enabling iothreads in OpenStack may not be worth it. One of our storage vendors had done some testing as well long ago, and they indicated that, to benefit from iothreads, dedicated cores should be used for the iothreads, which creates a bit more resource allocation complexity in OpenStack, especially if live migration is required. Eric
It appears that the "noop" scheduler merges requests, so you are likely getting between 3 and 4 I/O command merges per command to go from 4800 to 17800 IOPS. I'll have to check on this to see if that changes anything on this end, since I thought that the default scheduler also performed command merging.
It appears that "deadline" is the only scheduler available in later kernel versions (at least in Ubuntu). Damian - do you recall what the scheduler was set to prior to changing it to the noop scheduler? Eric
On 12.01.2022 22:10, Eric K. Miller wrote:
It appears that "deadline" is the only scheduler available in later kernel versions (at least in Ubuntu).
Damian - do you recall what the scheduler was set to prior to changing it to the noop scheduler?
The default in my case (Debian 9) is: root@diskbench:~# cat /sys/block/sda/queue/scheduler noop deadline [cfq] And the disk is detected by the kernel as "rotational": root@diskbench:~# cat /sys/block/sda/queue/rotational 1 -- Damian Pietras HardIT
participants (4)
-
Damian Pietras
-
Eric K. Miller
-
Laurent Dumont
-
Sean Mooney