Hi,
There are two main patches that I am interested in back-porting to improve the performance of the DB queries issued frequently by L2 agents while they are hosting VMs. These are not one-time queries during specific operations (e.g. create/delete), they also happen during normal periodic checks from the L2 agent. Due this constant background behavior, the agents start to trample the Neutron server once the deployment size scales up and will eventually exceed its resources so it can no longer service API requests even though nothing is changing.
The only work-around for this right now is to abnormally scale (compared to any of the other standard OpenStack services) the Neutron server and the MySQL nodes to handle the query load. This is really discouraging to deployers (lots of extra compute power wasted as service nodes) and makes Neutron appear extremely unstable to deployers who do not know Neutron needs to be special-cased in this manner.
The first patch is to batch up the ports being requested from an RPC agent before querying the database.[1] This is an internal-only change (doesn't affect the data delivered to RCP callers). Before, the server was calling the DB for each port individually so a query from a high-density port node like an L3 agent could result in 1000+ DB queries to the database. Now the service will query the database for all of the port information at once and then group it by port like the agents expect. This is probably the most significant improvement when dealing with high-density nodes and there is a rally performance graph demonstrating this in the comments.
The second patch is to eliminate a join across the Neutron port table that was a completely unnecessary calculation for the DB to perform and a waste of data returned (every column from every table in the query).[2] This also doesn't change the data returned to the caller of the function (no missing dict entries, etc), so we shouldn't have to worry about out-of-tree drivers, tools, etc. being broken by this either. I will run the rally performance numbers for this one as well after the first patch gets merged since it has a higher impact than this one.
Let me know if I need to elaborate on anything.
Thanks,
Kevin Benton