Hi,

There are two main patches that I am interested in back-porting to improve the performance of the DB queries issued frequently by L2 agents while they are hosting VMs. These are not one-time queries during specific operations (e.g. create/delete), they also happen during normal periodic checks from the L2 agent. Due this constant background behavior, the agents start to trample the Neutron server once the deployment size scales up and will eventually exceed its resources so it can no longer service API requests even though nothing is changing.

The only work-around for this right now is to abnormally scale (compared to any of the other standard OpenStack services) the Neutron server and the MySQL nodes to handle the query load. This is really discouraging to deployers (lots of extra compute power wasted as service nodes) and makes Neutron appear extremely unstable to deployers who do not know Neutron needs to be special-cased in this manner.

The first patch is to batch up the ports being requested from an RPC agent before querying the database.[1] This is an internal-only change (doesn't affect the data delivered to RCP callers). Before, the server was calling the DB for each port individually so a query from a high-density port node like an L3 agent could result in 1000+ DB queries to the database. Now the service will query the database for all of the port information at once and then group it by port like the agents expect. This is probably the most significant improvement when dealing with high-density nodes and there is a rally performance graph demonstrating this in the comments.

The second patch is to eliminate a join across the Neutron port table that was a completely unnecessary calculation for the DB to perform and a waste of data returned (every column from every table in the query).[2] This also doesn't change the data returned to the caller of the function (no missing dict entries, etc), so we shouldn't have to worry about out-of-tree drivers, tools, etc. being broken by this either. I will run the rally performance numbers for this one as well after the first patch gets merged since it has a higher impact than this one.

Let me know if I need to elaborate on anything.

1. https://review.openstack.org/#/c/132372/
2. https://review.openstack.org/#/c/130101/

Thanks,
Kevin Benton

On Wed, Oct 29, 2014 at 6:09 AM, Ihar Hrachyshka <ihrachys@redhat.com> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 29/10/14 14:00, Dolph Mathews wrote:
>
> On Wed, Oct 29, 2014 at 5:23 AM, Ihar Hrachyshka
> <ihrachys@redhat.com <mailto:ihrachys@redhat.com>> wrote:
>
> Hi all,
>
> there is a series of Neutron backports in the Juno queue that are
> intended to significantly improve service performance when
> handling security groups (one of the issues that are main pain
> points of current users):
>
> - https://review.openstack.org/130101 -
> https://review.openstack.org/130098 -
> https://review.openstack.org/130100 -
> https://review.openstack.org/130097 -
> https://review.openstack.org/130105
>
> The first four patches are optimizing db side (controller), while
> the last one is to avoid fetching security group rules by OVS agent
> when firewall is disabled.
>
> AFAIK we don't generally backport performance improvements unless
> they are very significant (though I don't see anything written in
> stone that says so), but knowing that those patches fix pain
> hotspots in Neutron, and seem rather isolated, should we consider
> their inclusion?
>
> Should we come up with some "official" rule on how we handle
> performance enhancement backports?
>
>
>> I'm very much in favor of backporting known performance
>> improvements, but in my experience, not all "performance
>> improvements" actually improve performance, so I'd expect an
>> appropriate benchmark to demonstrate a real performance benefit
>> to coincide with the proposed patch.

Exactly. That's what I asked to elaborate on at:
https://review.openstack.org/#/c/130101/

Also, adding Kevin into CC to make sure he is aware of the discussion.

>
>> For a hypothetical example, what seems like a clear cut
>> improvement in review 130098 (remove unused columns from a query)
>> *might* have an unforeseen side effect later on, where another
>> component doesn't have the data it needs, so it suddenly starts
>> issuing a new DB query to compensate. OpenStack is certainly
>> complicated enough that it's impossible to make accurate
>> assumptions about performance.
>
>
>
> /Ihar
>
> _______________________________________________
> Openstack-stable-maint mailing list
> Openstack-stable-maint@lists.openstack.org
> <mailto:Openstack-stable-maint@lists.openstack.org>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-stable-maint
>
>
>
>
>
> _______________________________________________
> Openstack-stable-maint mailing list
> Openstack-stable-maint@lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-stable-maint
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)

iQEcBAEBCgAGBQJUUObtAAoJEC5aWaUY1u57UYwH/j+wjiydOXjA+lFi3l1Pbl5f
s7r4Ox6FCPPVoAKziKpygKRbHTrCTew4DcgOxZhmC9qoq+Rk8Q1WFMLlBQ+51Kjj
lj/72JiPenKvuZSl/E+9FsmWP7ReCCyUMYWiQS6wp6FAd5KpQMMgdjleUQWEAgjN
Y1M9kYVOmqnYHQy4oWJsV0Od2wFKFAGDKohLEzDocmTQFxcfkEeMSn3qJ4aOwkoz
KmTFKPGAGU8eTyYNAs3sHa0t9VFwvPoBg4EjMXBjkuoRxz+Nf/IPUZmrruXQ7LM6
ioXEUH3GdKQSCKWtYoFFI1QPpiTQSIalO6nURxUg0UldW6i5QwIX1LTz8GMG+TQ=
=JJq0
-----END PGP SIGNATURE-----



--
Kevin Benton