[Openstack-operators] [neutron] neutron-server high CPU and RAM load when updating DHCP ports
mdorman at godaddy.com
Fri Dec 23 17:18:15 UTC 2016
We noticed an issue in one of our larger clouds (~700 hypervisors and ovs agents) where (Liberty) neutron-server CPU and RAM load would spike up quite a bit whenever a DHCP agent port was updated. So much load that processes were getting OOM killed on our API servers, and so many queries were going to the database that it was affecting performance of other APIs (sharing the same database cluster.)
Kris Lindgren determined what was happening: any time a DHCP port is changed, Neutron forces a complete refresh of all security group filter rules for all ports on the same network as the DHCP port. We run only provider networks which VMs plug directly into, and our largest network has several thousand ports. This was generating an avalanche of RPCs from the OVS agents, thus loading up neutron-server with a lot of work.
We only use DHCP has a backup network configuration mechanism in case something goes wrong with config-drive, so we are not huge users of it. But DHCP agents are being scheduled and removed often enough, and our networks contain a large enough number of ports, that this has begun affecting us quite a bit.
Kevin Benton suggested a minor patch  to disable the blanket refresh of security group filters on DHCP port changes. I tested it out in our staging environment and can confirm that:
- Security group filters are indeed not refreshed on DHCP port changes
- iptables rules generated for regular VM ports still include the generic rules to allow the DHCP ports 67 and 68 regardless of the presence of a DHCP agent on the network or not. (This covers the scenario where VM ports are created while there are no DHCP agents, and a DHCP agent is added later.)
I think there are some plans to deprecate this behavior. As far as I know, it still exists in master neutron. I’m happy to put the trivial patch up for review if people think this is a useful change to Neutron.
We are a bit of an edge case with such large number of ports per network. We are also considering disabling DHCP altogether since it is really not used. But, wanted to share the experience with others in case people are running into the same issue.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-operators