Re: [ops][nova] RBD IOPS bottleneck on client-side
We had a similar performance issue with networking (via openswitch) instead of I/O. Our hypervisor and VM configuration were like yours (VCPU pinning + isolcpus). We saw a 50% drop in virtualized networking throughput (measure via iperf). This was because the vhost_net kthreads which are responsible for the virtualized networking were pinned to 2 cores per socket and this quickly became the bottleneck. This was with OpenStack Queens and RHEL 7.6. We ended up keeping the VCPU pinning but removing the isolcpus kernel setting. This fixed the performance regression. Unfortunately, we didn’t further investigate this, so I don’t know why a newer kernel and/or newer Openstack release improves it. Hope this still helps Best Ümit On 27.12.22, 13:33, "Can Özyurt" <acozyurt@gmail.com> wrote: Hi everyone, I hope you are all doing well. We are trying to pinpoint an IOPS problem with RBD and decided to ask you for your take on it. 1 control plane 1 compute node 5 storage with 8 SSD disks each Openstack Stein/Ceph Mimic deployed with kolla-ansible on ubuntu-1804 (kernel 5.4) isolcpus 4-127 on compute vcpu_pin_set 4-127 in nova.conf image_metadatas: hw_scsi_model: virtio-scsi hw_disk_bus: scsi flavor_metadatas: hw:cpu_policy: dedicated What we have tested: fio --directory=. --ioengine=libaio --direct=1 --name=benchmark_random_read_write --filename=test_rand --bs=4k --iodepth=32 --size=1G --readwrite=randrw --rwmixread=50 --time_based --runtime=300s --numjobs=16 1. First we run the fio test above on a guest VM, we see average 5K/5K read/write IOPS consistently. What we realize is that during the test, one single core on compute host is used at max, which is the first of the pinned cpus of the guest. 'top -Hp $qemupid' shows that some threads (notably tp_librbd) share the very same core throughout the test. (also emulatorpin set = vcpupin set as expected) 2. We remove isolcpus and every other configuration stays the same. Now fio tests now show 11K/11K read/write IOPS. No bottlenecked single cpu on the host, observed threads seem to visit all emulatorpins. 3. We bring isolcpus back and redeploy the cluster with Train/Nautilus on ubuntu-1804. Observations are identical to #1. 4. We tried replacing vcpu_pin_set to cpu_shared_set and cpu_dedicated_set to be able to pin emulator cpuset to 0-4 to no avail. Multiple guests on a host can easily deplete resources and IOPS drops. 5. Isolcpus are still in place and we deploy Ussuri with kolla-ansible and Train (to limit the moving parts) with ceph-ansible both on ubuntu-1804. Now we see 7K/7K read/write IOPS. 6. We destroy only the compute node and boot it with ubuntu-2004 with isolcpus set. Add it back to the existing cluster and fio shows slightly above 10K/10K read/write IOPS. What we think happens: 1. Since isolcpus disables scheduling between given cpus, qemu process and its threads are stuck at the same cpu which created the bottleneck. They should be runnable on any given emulatorpin cpus. 2. Ussuri is more performant despite isolcpus, with the improvements made over time. 3. Ubuntu-2004 is more performant despite isolcpus, with the improvements made over time in the kernel. Now the questions are: 1. What else are we missing here? 2. Are any of those assumptions false? 3. If all true, what can we do to solve this issue given that we cannot upgrade openstack nor ceph on production overnight? 4. Has anyone dealt with this issue before? We welcome any opinion and suggestions at this point as we need to make sure that we are on the right path regarding the problem and upgrade is not the only solution. Thanks in advance.
Thanks for the fast reply and for sharing your experience. We have considered removing isolcpus as well but the idea of introducing noise into guest workload is somewhat concerning. Also restraining dockerized deployment without isolcpus will not be as easy. We definitely keep this option as a last resort. On Wed, 28 Dec 2022 at 15:59, Ümit Seren <uemit.seren@gmail.com> wrote:
We had a similar performance issue with networking (via openswitch) instead of I/O.
Our hypervisor and VM configuration were like yours (VCPU pinning + isolcpus). We saw a 50% drop in virtualized networking throughput (measure via iperf). This was because the vhost_net kthreads which are responsible for the virtualized networking were pinned to 2 cores per socket and this quickly became the bottleneck. This was with OpenStack Queens and RHEL 7.6.
We ended up keeping the VCPU pinning but removing the isolcpus kernel setting. This fixed the performance regression.
Unfortunately, we didn’t further investigate this, so I don’t know why a newer kernel and/or newer Openstack release improves it.
Hope this still helps
Best
Ümit
On 27.12.22, 13:33, "Can Özyurt" <acozyurt@gmail.com> wrote:
Hi everyone,
I hope you are all doing well. We are trying to pinpoint an IOPS
problem with RBD and decided to ask you for your take on it.
1 control plane
1 compute node
5 storage with 8 SSD disks each
Openstack Stein/Ceph Mimic deployed with kolla-ansible on ubuntu-1804
(kernel 5.4)
isolcpus 4-127 on compute
vcpu_pin_set 4-127 in nova.conf
image_metadatas:
hw_scsi_model: virtio-scsi
hw_disk_bus: scsi
flavor_metadatas:
hw:cpu_policy: dedicated
What we have tested:
fio --directory=. --ioengine=libaio --direct=1
--name=benchmark_random_read_write --filename=test_rand --bs=4k
--iodepth=32 --size=1G --readwrite=randrw --rwmixread=50 --time_based
--runtime=300s --numjobs=16
1. First we run the fio test above on a guest VM, we see average 5K/5K
read/write IOPS consistently. What we realize is that during the test,
one single core on compute host is used at max, which is the first of
the pinned cpus of the guest. 'top -Hp $qemupid' shows that some
threads (notably tp_librbd) share the very same core throughout the
test. (also emulatorpin set = vcpupin set as expected)
2. We remove isolcpus and every other configuration stays the same.
Now fio tests now show 11K/11K read/write IOPS. No bottlenecked single
cpu on the host, observed threads seem to visit all emulatorpins.
3. We bring isolcpus back and redeploy the cluster with Train/Nautilus
on ubuntu-1804. Observations are identical to #1.
4. We tried replacing vcpu_pin_set to cpu_shared_set and
cpu_dedicated_set to be able to pin emulator cpuset to 0-4 to no
avail. Multiple guests on a host can easily deplete resources and IOPS
drops.
5. Isolcpus are still in place and we deploy Ussuri with kolla-ansible
and Train (to limit the moving parts) with ceph-ansible both on
ubuntu-1804. Now we see 7K/7K read/write IOPS.
6. We destroy only the compute node and boot it with ubuntu-2004 with
isolcpus set. Add it back to the existing cluster and fio shows
slightly above 10K/10K read/write IOPS.
What we think happens:
1. Since isolcpus disables scheduling between given cpus, qemu process
and its threads are stuck at the same cpu which created the
bottleneck. They should be runnable on any given emulatorpin cpus.
2. Ussuri is more performant despite isolcpus, with the improvements
made over time.
3. Ubuntu-2004 is more performant despite isolcpus, with the
improvements made over time in the kernel.
Now the questions are:
1. What else are we missing here?
2. Are any of those assumptions false?
3. If all true, what can we do to solve this issue given that we
cannot upgrade openstack nor ceph on production overnight?
4. Has anyone dealt with this issue before?
We welcome any opinion and suggestions at this point as we need to
make sure that we are on the right path regarding the problem and
upgrade is not the only solution. Thanks in advance.
On Thu, 2022-12-29 at 12:30 +0300, Can Özyurt wrote:
Thanks for the fast reply and for sharing your experience.
We have considered removing isolcpus as well but the idea of introducing noise into guest workload is somewhat concerning. Also restraining dockerized deployment without isolcpus will not be as easy. We definitely keep this option as a last resort.
in our downstream product and also a a general upstream recomendation we discourage using isolcpus unless its a realtime host. when isolcpus is used you need to ensure that you run the qemu emulator thread on a core that does not overlap with the vm cpus. if you have a new enough nova that supports cpu_shared_set then you can define that in your nova.conf and use the shared emulator threads policy otherwise you will need to use the isolate policy for the emulator threads. The emulator threads policy feature was intoduced in pike https://specs.openstack.org/openstack/nova-specs/specs/pike/implemented/libv... note that if you are using train or later we changed the meaning of cpu_shared_set https://specs.openstack.org/openstack/nova-specs/specs/train/implemented/cpu... from train on its primary use when combined with cpu_dedicated_Set is to define the cores that will be used for emulator threads and floating vms. where as before train it was only for emulator threads. the cpu resouces spec explains this change in more detail but i woudl recommend using the new behior if you cloud supprots it as vcpu_pin_set will be removed in a future release (hopefully B or C if we get time.) the performance hit is cause because the emulator thread and the vm CPU are compeating for execution on the same core and with isolcpus the emulator tread will not automatically float to one of the vms cores that is idle since the the kernel schduler is prevented form doing that via isolcpus. when using kvm to accclerate qemu kvm is offloading the device emulation fo the cpu to the kvm kernel module but the device emulation fo storage/network devices like the virtio-blk or virtio-scsi contoler is done on the emulator thread in the absence of iothreads(a qemu feature which nova does not support). as such when using isolcpus the deployer must ensure the emulator thread and vm cpus must not over lap using the hw:cpu_emulator_threads extra specs and config options. if you are not running a realtime kernel you should remove isolcpus if you are then you shoudl correctly configure nova-compute with a pool of cpus to use for the emulator thread and update the flavor accordingly to use hw:cpu_emulator_threads=isolate|share.
On Wed, 28 Dec 2022 at 15:59, Ümit Seren <uemit.seren@gmail.com> wrote:
We had a similar performance issue with networking (via openswitch) instead of I/O.
Our hypervisor and VM configuration were like yours (VCPU pinning + isolcpus). We saw a 50% drop in virtualized networking throughput (measure via iperf). This was because the vhost_net kthreads which are responsible for the virtualized networking were pinned to 2 cores per socket and this quickly became the bottleneck. This was with OpenStack Queens and RHEL 7.6.
We ended up keeping the VCPU pinning but removing the isolcpus kernel setting. This fixed the performance regression.
Unfortunately, we didn’t further investigate this, so I don’t know why a newer kernel and/or newer Openstack release improves it.
Hope this still helps
Best
Ümit
On 27.12.22, 13:33, "Can Özyurt" <acozyurt@gmail.com> wrote:
Hi everyone,
I hope you are all doing well. We are trying to pinpoint an IOPS
problem with RBD and decided to ask you for your take on it.
1 control plane
1 compute node
5 storage with 8 SSD disks each
Openstack Stein/Ceph Mimic deployed with kolla-ansible on ubuntu-1804
(kernel 5.4)
isolcpus 4-127 on compute
vcpu_pin_set 4-127 in nova.conf
image_metadatas:
hw_scsi_model: virtio-scsi
hw_disk_bus: scsi
flavor_metadatas:
hw:cpu_policy: dedicated
What we have tested:
fio --directory=. --ioengine=libaio --direct=1
--name=benchmark_random_read_write --filename=test_rand --bs=4k
--iodepth=32 --size=1G --readwrite=randrw --rwmixread=50 --time_based
--runtime=300s --numjobs=16
1. First we run the fio test above on a guest VM, we see average 5K/5K
read/write IOPS consistently. What we realize is that during the test,
one single core on compute host is used at max, which is the first of
the pinned cpus of the guest. 'top -Hp $qemupid' shows that some
threads (notably tp_librbd) share the very same core throughout the
test. (also emulatorpin set = vcpupin set as expected)
2. We remove isolcpus and every other configuration stays the same.
Now fio tests now show 11K/11K read/write IOPS. No bottlenecked single
cpu on the host, observed threads seem to visit all emulatorpins.
3. We bring isolcpus back and redeploy the cluster with Train/Nautilus
on ubuntu-1804. Observations are identical to #1.
4. We tried replacing vcpu_pin_set to cpu_shared_set and
cpu_dedicated_set to be able to pin emulator cpuset to 0-4 to no
avail. Multiple guests on a host can easily deplete resources and IOPS
drops.
5. Isolcpus are still in place and we deploy Ussuri with kolla-ansible
and Train (to limit the moving parts) with ceph-ansible both on
ubuntu-1804. Now we see 7K/7K read/write IOPS.
6. We destroy only the compute node and boot it with ubuntu-2004 with
isolcpus set. Add it back to the existing cluster and fio shows
slightly above 10K/10K read/write IOPS.
What we think happens:
1. Since isolcpus disables scheduling between given cpus, qemu process
and its threads are stuck at the same cpu which created the
bottleneck. They should be runnable on any given emulatorpin cpus.
2. Ussuri is more performant despite isolcpus, with the improvements
made over time.
3. Ubuntu-2004 is more performant despite isolcpus, with the
improvements made over time in the kernel.
Now the questions are:
1. What else are we missing here?
2. Are any of those assumptions false?
3. If all true, what can we do to solve this issue given that we
cannot upgrade openstack nor ceph on production overnight?
4. Has anyone dealt with this issue before?
We welcome any opinion and suggestions at this point as we need to
make sure that we are on the right path regarding the problem and
upgrade is not the only solution. Thanks in advance.
participants (3)
-
Can Özyurt
-
Sean Mooney
-
Ümit Seren