[neutron] api performance at scale
Is there a SIG or similar discussion on neutron performance at scale? For us nova used to be the biggest concern, but a lot of work has been done and nova now performers great. Instead we are having issues to get Neutron to perform at scale. Obvious calls like security groups are performing really poorly, and nova-compute defaults for refreshing the network cache on computes causes massive issues with Neutron. Best Regards, Erik Olof Gunnar Andersson
On 12/3/2019 11:24 AM, Erik Olof Gunnar Andersson wrote:
For us nova used to be the biggest concern, but a lot of work has been done and nova now performers great. Instead we are having issues to get Neutron to perform at scale. Obvious calls like security groups are performing really poorly, and nova-compute defaults for refreshing the network cache on computes causes massive issues with Neutron.
I wonder how much of the performance hit is due to rootwrap usage in neutron (nova's conversion to privsep was completed in Train). Nova might be the bees knees, but I know there are things in nova we could do to be smarter about not hammering the neutron API as much, e.g.: https://review.opendev.org/#/c/465792/ - make bulk queries to neutron when refreshing the instance network info cache https://review.opendev.org/#/q/I7de14456d04370c842b4c35597dca3a628a826a2 - be smarter about filtering to avoid expensive joins https://bugs.launchpad.net/nova/+bug/1567655 - nova's internal network info cache only stores information about ports and their related networks/subnets/ips but the security group information related to the ports attached to a server is fetched directly anytime it's needed, including when listing servers with details. So if you're an admin listing all servers across all tenants, that could get pretty slow. I've long thought we should cache the security group information like we do for ports for read-only operations like GET /servers/detail but it's a non-trivial amount of work to make that happen and we'd definitely want benchmarks and such to justify the change. Note ttx has started a large ops SIG or whatever so this is probably something to discuss there: https://wiki.openstack.org/wiki/Large_Scale_SIG -- Thanks, Matt
Yea - I think those patches would help a lot. Especially the security group related change. Security groups for some reason are the most expensive call in Neutron for us. In our larger deployments the simplest security group list commands takes 60 seconds to perform. We had very similar issues with neutron-lbaas, but those calls has since been fixed. The large ops SIG is unfortunately 1AM (Pacific Time) over here. I can try to attend it, but wouldn't be easy. ________________________________ From: Matt Riedemann <mriedemos@gmail.com> Sent: Tuesday, December 3, 2019 9:44 AM To: openstack-discuss@lists.openstack.org <openstack-discuss@lists.openstack.org> Subject: Re: [neutron] api performance at scale On 12/3/2019 11:24 AM, Erik Olof Gunnar Andersson wrote:
For us nova used to be the biggest concern, but a lot of work has been done and nova now performers great. Instead we are having issues to get Neutron to perform at scale. Obvious calls like security groups are performing really poorly, and nova-compute defaults for refreshing the network cache on computes causes massive issues with Neutron.
I wonder how much of the performance hit is due to rootwrap usage in neutron (nova's conversion to privsep was completed in Train). Nova might be the bees knees, but I know there are things in nova we could do to be smarter about not hammering the neutron API as much, e.g.: https://urldefense.com/v3/__https://review.opendev.org/*/c/465792/__;Iw!2E0g... - make bulk queries to neutron when refreshing the instance network info cache https://urldefense.com/v3/__https://review.opendev.org/*/q/I7de14456d04370c8... - be smarter about filtering to avoid expensive joins https://urldefense.com/v3/__https://bugs.launchpad.net/nova/*bug/1567655__;K... - nova's internal network info cache only stores information about ports and their related networks/subnets/ips but the security group information related to the ports attached to a server is fetched directly anytime it's needed, including when listing servers with details. So if you're an admin listing all servers across all tenants, that could get pretty slow. I've long thought we should cache the security group information like we do for ports for read-only operations like GET /servers/detail but it's a non-trivial amount of work to make that happen and we'd definitely want benchmarks and such to justify the change. Note ttx has started a large ops SIG or whatever so this is probably something to discuss there: https://urldefense.com/v3/__https://wiki.openstack.org/wiki/Large_Scale_SIG_... -- Thanks, Matt
On 12/3/2019 11:24 AM, Erik Olof Gunnar Andersson wrote:
For us nova used to be the biggest concern, but a lot of work has been done and nova now performers great. Instead we are having issues to get Neutron to perform at scale. Obvious calls like security groups are performing really poorly, and nova-compute defaults for refreshing the network cache on computes causes massive issues with Neutron.
I wonder how much of the performance hit is due to rootwrap usage in neutron (nova's conversion to privsep was completed in Train). Nova might be the bees knees, but I know there are things in nova we could do to be smarter about not hammering the neutron API as much, e.g.: https://review.opendev.org/#/c/465792/ - make bulk queries to neutron when refreshing the instance network info cache https://review.opendev.org/#/q/I7de14456d04370c842b4c35597dca3a628a826a2 - be smarter about filtering to avoid expensive joins https://bugs.launchpad.net/nova/+bug/1567655 - nova's internal network info cache only stores information about ports and their related networks/subnets/ips but the security group information related to the ports attached to a server is fetched directly anytime it's needed, including when listing servers with details. So if you're an admin listing all servers across all tenants, that could get pretty slow. I've long thought we should cache the security group information like we do for ports for read-only operations like GET /servers/detail but it's a non-trivial amount of work to make that happen and we'd definitely want benchmarks and such to justify the change. Note ttx has started a large ops SIG or whatever so this is probably something to discuss there: https://wiki.openstack.org/wiki/Large_Scale_SIG -- Thanks, Matt
Hi, In the past we had biweekly meeting related to performance of Neutron. Now we included this as one of the points on Monday's Neutron team meeting. Please sync with Miguel Lavalle about that. He is leader of this performance subteam in Neutron and he is working on some profiling and identifying things which are slowing Neutron most. Speaking about security groups, is Your problem on API level or backend level? If it's on backend, what firewall driver are You using? Openvswitch or iptables_hybrid (or maybe some other)? Also, I know we have big performance issue if You are using security group with remote_security_group set in it (it's added by default to default SG). In such case if You have many ports using same SG, every time when You add new port to this SG, all other ports are updated by L2 agent and that is very slow if there is many ports there. So removing remote_security_group from rules and create rules based on remote CIDRs would help a lot with this. We were discussing this in Denver PTG but I don't think any bug on launchpad was reported for this. On Tue, Dec 03, 2019 at 05:24:54PM +0000, Erik Olof Gunnar Andersson wrote:
Is there a SIG or similar discussion on neutron performance at scale?
For us nova used to be the biggest concern, but a lot of work has been done and nova now performers great. Instead we are having issues to get Neutron to perform at scale. Obvious calls like security groups are performing really poorly, and nova-compute defaults for refreshing the network cache on computes causes massive issues with Neutron.
Best Regards, Erik Olof Gunnar Andersson
-- Slawek Kaplonski Senior software engineer Red Hat
The problem is primarily on the API level. Just to give you an idea, we have ansible playbooks we use to manage security groups in one region that needs to be re-designed, but before we upgraded from Mitaka it took maybe 10 minutes to run, with Rocky it takes about 7 hours. I'll reach out to Miguel. Other issues has been - The DHCP agent performing very poorly many ports, but I think all of those issues has been fixed and backported to Rocky now. - Extreme memory usage. Each Neutron-server process uses up to 8.2GB memory during high load. - Lots of database locking causing issues, especially after an upgrade. One of the side effects has been that agent heartbeat queue to grow exponentially. It has recently been disused heavily in the neutron channel. Very similar to this bug https://bugs.launchpad.net/neutron/+bug/1853071 Thanks! Best Regards, Erik Olof Gunnar Andersson -----Original Message----- From: Slawek Kaplonski <skaplons@redhat.com> Sent: Wednesday, December 4, 2019 2:25 AM To: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Cc: openstack-discuss@lists.openstack.org Subject: Re: [neutron] api performance at scale Hi, In the past we had biweekly meeting related to performance of Neutron. Now we included this as one of the points on Monday's Neutron team meeting. Please sync with Miguel Lavalle about that. He is leader of this performance subteam in Neutron and he is working on some profiling and identifying things which are slowing Neutron most. Speaking about security groups, is Your problem on API level or backend level? If it's on backend, what firewall driver are You using? Openvswitch or iptables_hybrid (or maybe some other)? Also, I know we have big performance issue if You are using security group with remote_security_group set in it (it's added by default to default SG). In such case if You have many ports using same SG, every time when You add new port to this SG, all other ports are updated by L2 agent and that is very slow if there is many ports there. So removing remote_security_group from rules and create rules based on remote CIDRs would help a lot with this. We were discussing this in Denver PTG but I don't think any bug on launchpad was reported for this. On Tue, Dec 03, 2019 at 05:24:54PM +0000, Erik Olof Gunnar Andersson wrote:
Is there a SIG or similar discussion on neutron performance at scale?
For us nova used to be the biggest concern, but a lot of work has been done and nova now performers great. Instead we are having issues to get Neutron to perform at scale. Obvious calls like security groups are performing really poorly, and nova-compute defaults for refreshing the network cache on computes causes massive issues with Neutron.
Best Regards, Erik Olof Gunnar Andersson
-- Slawek Kaplonski Senior software engineer Red Hat
participants (3)
-
Erik Olof Gunnar Andersson
-
Matt Riedemann
-
Slawek Kaplonski