[nova] Slow nvme performance for local storage instances

Sven Kieske kieske at osism.tech
Mon Aug 14 15:29:41 UTC 2023


Hi,

Am Montag, dem 14.08.2023 um 14:37 +0200 schrieb Jan Wasilewski:
> *[2] fio results of OpenStack managed instance with "vdb" attached:
> https://paste.openstack.org/show/bViUpJTf7UYpsRyGCAt9/
> <https://paste.openstack.org/show/bViUpJTf7UYpsRyGCAt9/>*
> *[3] dumpxml of Libvirt managed instance with "vdb" attached:
> https://paste.openstack.org/show/bGv8dT1l2QaTiAybYrJi/
> <https://paste.openstack.org/show/bGv8dT1l2QaTiAybYrJi/>*
> *[4] fio results of Libvirt managed instance with "vdb" attached:
> https://paste.openstack.org/show/bOzYXkbco0oDfgaD0co8/
> <https://paste.openstack.org/show/bOzYXkbco0oDfgaD0co8/>*
> *[5] xml configuration of vdb drive:
> https://paste.openstack.org/show/bAJ9MyEWEGOteeJnH5D8/
> <https://paste.openstack.org/show/bAJ9MyEWEGOteeJnH5D8/>*

one difference I can see in the fio results, is that the openstack
provided vm does a lot more context switches and has a different cpu
usage profile in general:

Openstack Instance:

  cpu          : usr=27.16%, sys=62.24%, ctx=3246653, majf=0, minf=14

plain libvirt instance:

  cpu          : usr=15.75%, sys=56.31%, ctx=2860657, majf=0, minf=15

this indicates, that some other workload is running there or work is
scheduled at least in a different way then on the plain libvirt
machine, one example to check might be the irq balancing on different
cores, but I can't remember atm, if this is fixed already on this
kernel release (iirc in the past you used to run the irq-balance daemon
which got obsolete after kernel 4.19 according to
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=926967 )

how many other vms are running on that openstack hypervisor?

I hope the hypervisor is not oversubscribed? You can easily see this
in a modern variant of "top" which reports stolen cpu cycles, if you
got cpu steal your cpu is oversubscribed.

depending on the deployment, you will of course also incur additional
overhead from other openstack services - beginning with nova, which
might account for the additional context switches on the hypervisor.

In general 3 million context switches is not that much and should not
impact performance by much, but it's still a noticeable difference
between the two systems.

are the cpu models on the hypervisors exactly the same? I can't tell it
from the libvirt dumps, but I notice that certain cpu flags are
explicitly set for the libvirt managed instance, which might affect the
end result.

What's more bothering is, that the libvirt provided VM
has a total cpu usage of roundabout 70% whereas the openstack provided
one is closer to 90%.

this leads me to believe that either one of the following is true:

- the hypervisor cpus differ in a meaningful way, performance wise.
- the hypervisor is somehow oversubscribed / has more work to do for
the openstack deployed server, which results in worse benchmarks/more
cpu being burnt by constantly evicting the task from the lower level
l1/l2 cpu caches.
- the context switches eat up significant cpu performance on the
openstack instance (least likely imho).

what would be interesting to know would be if mq-deadline and multi
queue are enabled in the plain libvirt machine (are libvirt and qemu
versions the same as in the openstack deploment?).

you can check this like it is described here:

https://bugzilla.redhat.com/show_bug.cgi?id=1827722

But I don't see "num_queues" or "queues" mentioned anywhere, so I
assume it's turned off. Enabling it could also boost your performance
by a lot.

Another thing to check - especially since I noticed the cpu differences
- would be the numa layout of the hypervisor and how the VM is affected
by it.

-- 
Sven Kieske
Senior Cloud Engineer

Mail: kieske at osism.tech
Web: https://osism.tech

OSISM GmbH
Teckstraße 62 / 70190 Stuttgart / Deutschland

Geschäftsführer: Christian Berendt
Unternehmenssitz: Stuttgart
Amtsgericht: Stuttgart, HRB 756139




More information about the openstack-discuss mailing list