We had a similar performance issue with networking (via openswitch) instead of I/O.

Our hypervisor and VM configuration were like yours (VCPU pinning + isolcpus). We saw a 50% drop in virtualized networking throughput (measure via iperf).
This was because the vhost_net kthreads which are responsible for the virtualized networking were pinned to 2 cores per socket and this quickly became the bottleneck. This was with OpenStack Queens and RHEL 7.6.

We ended up keeping the VCPU pinning but removing the isolcpus kernel setting. This fixed the performance regression.

Unfortunately, we didn’t further investigate this, so I don’t know why a newer kernel and/or newer Openstack release improves it.

Hope this still helps

Best

Ümit

On 27.12.22, 13:33, "Can Özyurt" <acozyurt@gmail.com> wrote:

Hi everyone,

I hope you are all doing well. We are trying to pinpoint an IOPS

problem with RBD and decided to ask you for your take on it.

1 control plane

1 compute node

5 storage with 8 SSD disks each

Openstack Stein/Ceph Mimic deployed with kolla-ansible on ubuntu-1804

(kernel 5.4)

isolcpus 4-127 on compute

vcpu_pin_set 4-127 in nova.conf

image_metadatas:

hw_scsi_model: virtio-scsi

hw_disk_bus: scsi

flavor_metadatas:

hw:cpu_policy: dedicated

What we have tested:

fio --directory=. --ioengine=libaio --direct=1

--name=benchmark_random_read_write --filename=test_rand --bs=4k

--iodepth=32 --size=1G --readwrite=randrw --rwmixread=50 --time_based

--runtime=300s --numjobs=16

1. First we run the fio test above on a guest VM, we see average 5K/5K

read/write IOPS consistently. What we realize is that during the test,

one single core on compute host is used at max, which is the first of

the pinned cpus of the guest. 'top -Hp $qemupid' shows that some

threads (notably tp_librbd) share the very same core throughout the

test. (also emulatorpin set = vcpupin set as expected)

2. We remove isolcpus and every other configuration stays the same.

Now fio tests now show 11K/11K read/write IOPS. No bottlenecked single

cpu on the host, observed threads seem to visit all emulatorpins.

3. We bring isolcpus back and redeploy the cluster with Train/Nautilus

on ubuntu-1804. Observations are identical to #1.

4. We tried replacing vcpu_pin_set to cpu_shared_set and

cpu_dedicated_set to be able to pin emulator cpuset to 0-4 to no

avail. Multiple guests on a host can easily deplete resources and IOPS

drops.

5. Isolcpus are still in place and we deploy Ussuri with kolla-ansible

and Train (to limit the moving parts) with ceph-ansible both on

ubuntu-1804. Now we see 7K/7K read/write IOPS.

6. We destroy only the compute node and boot it with ubuntu-2004 with

isolcpus set. Add it back to the existing cluster and fio shows

slightly above 10K/10K read/write IOPS.

What we think happens:

1. Since isolcpus disables scheduling between given cpus, qemu process

and its threads are stuck at the same cpu which created the

bottleneck. They should be runnable on any given emulatorpin cpus.

2. Ussuri is more performant despite isolcpus, with the improvements

made over time.

3. Ubuntu-2004 is more performant despite isolcpus, with the

improvements made over time in the kernel.

Now the questions are:

1. What else are we missing here?

2. Are any of those assumptions false?

3. If all true, what can we do to solve this issue given that we

cannot upgrade openstack nor ceph on production overnight?

4. Has anyone dealt with this issue before?

We welcome any opinion and suggestions at this point as we need to

make sure that we are on the right path regarding the problem and

upgrade is not the only solution. Thanks in advance.