Hi, Am Montag, dem 14.08.2023 um 14:37 +0200 schrieb Jan Wasilewski:
*[2] fio results of OpenStack managed instance with "vdb" attached: https://paste.openstack.org/show/bViUpJTf7UYpsRyGCAt9/ <https://paste.openstack.org/show/bViUpJTf7UYpsRyGCAt9/>* *[3] dumpxml of Libvirt managed instance with "vdb" attached: https://paste.openstack.org/show/bGv8dT1l2QaTiAybYrJi/ <https://paste.openstack.org/show/bGv8dT1l2QaTiAybYrJi/>* *[4] fio results of Libvirt managed instance with "vdb" attached: https://paste.openstack.org/show/bOzYXkbco0oDfgaD0co8/ <https://paste.openstack.org/show/bOzYXkbco0oDfgaD0co8/>* *[5] xml configuration of vdb drive: https://paste.openstack.org/show/bAJ9MyEWEGOteeJnH5D8/ <https://paste.openstack.org/show/bAJ9MyEWEGOteeJnH5D8/>*
one difference I can see in the fio results, is that the openstack provided vm does a lot more context switches and has a different cpu usage profile in general: Openstack Instance: cpu : usr=27.16%, sys=62.24%, ctx=3246653, majf=0, minf=14 plain libvirt instance: cpu : usr=15.75%, sys=56.31%, ctx=2860657, majf=0, minf=15 this indicates, that some other workload is running there or work is scheduled at least in a different way then on the plain libvirt machine, one example to check might be the irq balancing on different cores, but I can't remember atm, if this is fixed already on this kernel release (iirc in the past you used to run the irq-balance daemon which got obsolete after kernel 4.19 according to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=926967 ) how many other vms are running on that openstack hypervisor? I hope the hypervisor is not oversubscribed? You can easily see this in a modern variant of "top" which reports stolen cpu cycles, if you got cpu steal your cpu is oversubscribed. depending on the deployment, you will of course also incur additional overhead from other openstack services - beginning with nova, which might account for the additional context switches on the hypervisor. In general 3 million context switches is not that much and should not impact performance by much, but it's still a noticeable difference between the two systems. are the cpu models on the hypervisors exactly the same? I can't tell it from the libvirt dumps, but I notice that certain cpu flags are explicitly set for the libvirt managed instance, which might affect the end result. What's more bothering is, that the libvirt provided VM has a total cpu usage of roundabout 70% whereas the openstack provided one is closer to 90%. this leads me to believe that either one of the following is true: - the hypervisor cpus differ in a meaningful way, performance wise. - the hypervisor is somehow oversubscribed / has more work to do for the openstack deployed server, which results in worse benchmarks/more cpu being burnt by constantly evicting the task from the lower level l1/l2 cpu caches. - the context switches eat up significant cpu performance on the openstack instance (least likely imho). what would be interesting to know would be if mq-deadline and multi queue are enabled in the plain libvirt machine (are libvirt and qemu versions the same as in the openstack deploment?). you can check this like it is described here: https://bugzilla.redhat.com/show_bug.cgi?id=1827722 But I don't see "num_queues" or "queues" mentioned anywhere, so I assume it's turned off. Enabling it could also boost your performance by a lot. Another thing to check - especially since I noticed the cpu differences - would be the numa layout of the hypervisor and how the VM is affected by it. -- Sven Kieske Senior Cloud Engineer Mail: kieske@osism.tech Web: https://osism.tech OSISM GmbH Teckstraße 62 / 70190 Stuttgart / Deutschland Geschäftsführer: Christian Berendt Unternehmenssitz: Stuttgart Amtsgericht: Stuttgart, HRB 756139