[EXTERNAL] Re: [neutron][ovn] OVN Performance

Apsey, Christopher CAPSEY at augusta.edu
Thu Aug 27 15:32:37 UTC 2020


Assaf,

We can absolutely support engineering poking around in our environment (and possibly an even larger one at my previous employer that was experiencing similar issues during testing).  We can take this offline so we don’t spam the mailing list.

Just let me know how to proceed,

Thanks!

Chris Apsey
GEORGIA CYBER CENTER

From: Assaf Muller <amuller at redhat.com>
Sent: Thursday, August 27, 2020 11:18 AM
To: Apsey, Christopher <CAPSEY at augusta.edu>
Cc: openstack-discuss at lists.openstack.org; Lucas Alvares Gomes Martins <lmartins at redhat.com>; Jakub Libosvar <jlibosva at redhat.com>; Daniel Alvarez Sanchez <dalvarez at redhat.com>
Subject: [EXTERNAL] Re: [neutron][ovn] OVN Performance

CAUTION: EXTERNAL SENDER This email originated from an external source. Please exercise caution before opening attachments, clicking links, replying, or providing information to the sender. If you believe it to be fraudulent, contact the AU Cybersecurity Hotline at 72-CYBER (2-9237 / 706-722-9237) or 72CYBER at augusta.edu<mailto:72CYBER at augusta.edu>

The most efficient way about this is to give one or more of the
Engineers working on OpenStack OVN upstream (I've added a few to this
thread) temporary access to an environment that can reproduce issues
you're seeing, we could then document the issues and work towards
solutions. If that's not possible, if you could provide reproducer
scripts, or alternatively sharpen the reproduction method, we'll take
a look. What you've described is not something that's 'acceptable',
OVN should definitely not scale worse than Neutron with the Linux
Bridge agent. It's possible that the particular issues you ran in to
is something that we've already seen internally at Red Hat, or with
our customers, and we're already working on fixes in future versions
of OVN - I can't tell you until you elaborate on the details of the
issues you're seeing. In any case, the upstream community is committed
to improving OVN scale and fixing scale issues as they pop up.
Coincidentally, Red Hat scale engineers just published an article [1]
about work they've done to scale RH-OSP 16.1 (== OpenStack Train on
CentOS 8, with OVN 2.13 and TripleO) to 700 compute nodes.

[1] https://www.redhat.com/en/blog/scaling-red-hat-openstack-platform-161-more-700-nodes?source=bloglisting<https://www.redhat.com/en/blog/scaling-red-hat-openstack-platform-161-more-700-nodes?source=bloglisting>

On Thu, Aug 27, 2020 at 10:44 AM Apsey, Christopher <CAPSEY at augusta.edu<mailto:CAPSEY at augusta.edu>> wrote:
>
> All,
>
>
>
> I know that OVN is going to become the default neutron backend at some point and displace linuxbridge as the default configuration option in the docs, but we have noticed a pretty significant performance disparity between OVN and linuxbridge on identical hardware over the past year or so in a few different environments[1]. I know that example is unscientific, but similar results have been borne out in many different scenarios from what we have observed. There are three main problems from what we see:
>
>
>
> 1. OVN does not handle large concurrent requests as well as linuxbridge. Additionally, linuxbridge concurrent capacity grows (not linearly, but grows nonetheless) by adding additional neutron API endpoints and RPC agents. OVN does not really horizontally scale by adding additional API endpoints, from what we have observed.
>
> 2. OVN gets significantly slower as load on the system grows. We have observed a soft cap of about 2000-2500 instances in a given deployment before ovn-backed neutron stops responding altogether to nova requests (even for booting a single instance). We have observed linuxbridge get to 5000+ instances before it starts to struggle on the same hardware (and we think that linuxbridge can go further with improved provider network design in that particular case).
>
> 3. Once the southbound database process hits 100% CPU usage on the leader in the ovn cluster, it’s game over (probably causes 1+2)
>
>
>
> It's entirely possible that we just don’t understand OVN well enough to tune it [2][3][4], but then the question becomes how do we get that tuning knowledge into the docs so people don’t scratch their heads when their cool new OVN deployment scales 40% as well as their ancient linuxbridge-based one?
>
>
>
> If it is ‘known’ that OVN has some scaling challenges, is there a plan to fix it, and what is the best way to contribute to doing so?
>
>
>
> We have observed similar results on Ubuntu 18.04/20.04 and CentOS 7/8 on Stein, Train, and Ussuri.
>
>
>
> [1] https://pastebin.com/kyyURTJm<https://pastebin.com/kyyURTJm>
>
> [2] https://github.com/GeorgiaCyber/kinetic/tree/master/formulas/ovsdb<https://github.com/GeorgiaCyber/kinetic/tree/master/formulas/ovsdb>
>
> [3] https://github.com/GeorgiaCyber/kinetic/tree/master/formulas/neutron<https://github.com/GeorgiaCyber/kinetic/tree/master/formulas/neutron>
>
> [4] https://github.com/GeorgiaCyber/kinetic/tree/master/formulas/compute<https://github.com/GeorgiaCyber/kinetic/tree/master/formulas/compute>
>
>
>
> Chris Apsey
>
> GEORGIA CYBER CENTER
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200827/46adf2f7/attachment-0001.html>


More information about the openstack-discuss mailing list