[neutron] api performance at scale

Erik Olof Gunnar Andersson eandersson at blizzard.com
Wed Dec 4 18:46:43 UTC 2019

The problem is primarily on the API level. Just to give you an idea, we have ansible playbooks we use to manage security groups in one region that needs to be re-designed, but before we upgraded from Mitaka it took maybe 10 minutes to run, with Rocky it takes about 7 hours.

I'll reach out to Miguel.

Other issues has been
- The DHCP agent performing very poorly many ports, but I think all of those issues has been fixed and backported to Rocky now.
- Extreme memory usage. Each Neutron-server process uses up to 8.2GB memory during high load.
- Lots of database locking causing issues, especially after an upgrade. One of the side effects has been that agent heartbeat queue to grow exponentially. It has recently been disused heavily in the neutron channel. Very similar to this bug https://bugs.launchpad.net/neutron/+bug/1853071


Best Regards, Erik Olof Gunnar Andersson

-----Original Message-----
From: Slawek Kaplonski <skaplons at redhat.com> 
Sent: Wednesday, December 4, 2019 2:25 AM
To: Erik Olof Gunnar Andersson <eandersson at blizzard.com>
Cc: openstack-discuss at lists.openstack.org
Subject: Re: [neutron] api performance at scale


In the past we had biweekly meeting related to performance of Neutron.
Now we included this as one of the points on Monday's Neutron team meeting.

Please sync with Miguel Lavalle about that. He is leader of this performance subteam in Neutron and he is working on some profiling and identifying things which are slowing Neutron most.

Speaking about security groups, is Your problem on API level or backend level?
If it's on backend, what firewall driver are You using? Openvswitch or iptables_hybrid (or maybe some other)?

Also, I know we have big performance issue if You are using security group with remote_security_group set in it (it's added by default to default SG).
In such case if You have many ports using same SG, every time when You add new port to this SG, all other ports are updated by L2 agent and that is very slow if there is many ports there.
So removing remote_security_group from rules and create rules based on remote CIDRs would help a lot with this.
We were discussing this in Denver PTG but I don't think any bug on launchpad was reported for this.

On Tue, Dec 03, 2019 at 05:24:54PM +0000, Erik Olof Gunnar Andersson wrote:
> Is there a SIG or similar discussion on neutron performance at scale?
> For us nova used to be the biggest concern, but a lot of work has been done and nova now performers great. Instead we are having issues to get Neutron to perform at scale. Obvious calls like security groups are performing really poorly, and nova-compute defaults for refreshing the network cache on computes causes massive issues with Neutron.
> Best Regards, Erik Olof Gunnar Andersson

Slawek Kaplonski
Senior software engineer
Red Hat

More information about the openstack-discuss mailing list