On 12/3/2019 11:24 AM, Erik Olof Gunnar Andersson wrote:
For us nova used to be the biggest concern, but a lot of work has been done and nova now performers great. Instead we are having issues to get Neutron to perform at scale. Obvious calls like security groups are performing really poorly, and nova-compute defaults for refreshing the network cache on computes causes massive issues with Neutron.
I wonder how much of the performance hit is due to rootwrap usage in neutron (nova's conversion to privsep was completed in Train). Nova might be the bees knees, but I know there are things in nova we could do to be smarter about not hammering the neutron API as much, e.g.: https://review.opendev.org/#/c/465792/ - make bulk queries to neutron when refreshing the instance network info cache https://review.opendev.org/#/q/I7de14456d04370c842b4c35597dca3a628a826a2 - be smarter about filtering to avoid expensive joins https://bugs.launchpad.net/nova/+bug/1567655 - nova's internal network info cache only stores information about ports and their related networks/subnets/ips but the security group information related to the ports attached to a server is fetched directly anytime it's needed, including when listing servers with details. So if you're an admin listing all servers across all tenants, that could get pretty slow. I've long thought we should cache the security group information like we do for ports for read-only operations like GET /servers/detail but it's a non-trivial amount of work to make that happen and we'd definitely want benchmarks and such to justify the change. Note ttx has started a large ops SIG or whatever so this is probably something to discuss there: https://wiki.openstack.org/wiki/Large_Scale_SIG -- Thanks, Matt