[neutron][ovn] OVN Performance

Daniel Alvarez Sanchez dalvarez at redhat.com
Wed Sep 2 13:42:34 UTC 2020


Hey Chris, thanks for sharing this :)

On Wed, Sep 2, 2020 at 3:30 PM Apsey, Christopher <CAPSEY at augusta.edu>
wrote:

> All,
>
>
>
> Just wanted to loop back here and give an update.
>
>
>
> For reference, [1] (blue means successful action, red means failed action)
> is the result we got when booting 5000 instances in rally [2] before the
> Red Hat OVN devs poked around inside our environment, and [3] is the result
> after.  The differences are obviously pretty significant.  I think the
> biggest change was setting metadata_workers = 2 in
> neutron_ovn_metadata_agent.ini on the compute nodes per
> https://bugs.launchpad.net/neutron/+bug/1893656.  We have 64C/128T on all
> compute nodes, so the default neutron calculation of scaling metadata
> workers based on available cores created 900+ connections to the southbound
> db at idle; after the control plane got loaded up it just quit around 2500
> instances (my guess is it hit the open file limit, although I don’t think
> increasing it would have made it better for much longer since the number of
> connections were increasing exponentially).  Capping the number of metadata
> workers decreased open southbound connections by 90%.  Even more telling
> was that rally was able to successfully clean up after itself after we made
> that change, whereas previously it wasn’t even able to successfully tear
> down any of the instances that were made, indicating that the control plane
> was completely toast.
>
>
>
> Note that the choppiness towards the end of [3] had nothing to do with OVN
> – our compute nodes had a loadavg approaching 1000 at that point, so they
> were just starved for cpu cycles.  This would have scaled even better with
> additional compute nodes.
>
>
>
> The other piece was RAFT.  Currently, RDO is shipping with ovs 2.12, but
> 2.13 has a bunch of RAFT fixes in it that improve stability and knock out
> some bugs.  We were having issues with chassis registration on 2.12, but
> after using the 2.13 package from cbs, all those issues went away.
>
>
>
> Big thanks to the great people at Red Hat on the cc line for volunteering
> their valuable time to take a look.
>

Happy to help, it was fun :) Thanks to you for all the details that made it
easier to debug

>
>
> I’m now significantly more comfortable with defaulting to OVN as the
> backend of choice as the performance delta is now gone.  That said, should
> the community consider dropping linuxbridge as the backend in the official
> upstream docs and jump straight to OVN rather than ml2/OVS?  I think that
> would increase the test base and help shine light on other issues as time
> goes on.  My org can devote some time to doing this work if the community
> agrees that it’s the right action to take.
>

++!!

>
>
> Hope that’s helpful!
>
>
>
> [1] https://ibb.co/GTjZP2y
>
> [2] https://pastebin.com/5pEDZ7dY
>
> [3] https://ibb.co/pfB9KTV
>

Do you have some baseline to compare against? Also I'm curious to see if
you pulled results with and without raft :)

Thanks once again!

>
>
> *Chris Apsey*
>
> *GEORGIA CYBER CENTER*
>
>
>
> *From:* Apsey, Christopher
> *Sent:* Thursday, August 27, 2020 11:33 AM
> *To:* Assaf Muller <amuller at redhat.com>
> *Cc:* openstack-discuss at lists.openstack.org; Lucas Alvares Gomes Martins <
> lmartins at redhat.com>; Jakub Libosvar <jlibosva at redhat.com>; Daniel
> Alvarez Sanchez <dalvarez at redhat.com>
> *Subject:* RE: [EXTERNAL] Re: [neutron][ovn] OVN Performance
>
>
>
> Assaf,
>
>
>
> We can absolutely support engineering poking around in our environment
> (and possibly an even larger one at my previous employer that was
> experiencing similar issues during testing).  We can take this offline so
> we don’t spam the mailing list.
>
>
>
> Just let me know how to proceed,
>
>
>
> Thanks!
>
>
>
> *Chris Apsey*
>
> *GEORGIA CYBER CENTER*
>
>
>
> *From:* Assaf Muller <amuller at redhat.com>
> *Sent:* Thursday, August 27, 2020 11:18 AM
> *To:* Apsey, Christopher <CAPSEY at augusta.edu>
> *Cc:* openstack-discuss at lists.openstack.org; Lucas Alvares Gomes Martins <
> lmartins at redhat.com>; Jakub Libosvar <jlibosva at redhat.com>; Daniel
> Alvarez Sanchez <dalvarez at redhat.com>
> *Subject:* [EXTERNAL] Re: [neutron][ovn] OVN Performance
>
>
>
> CAUTION: EXTERNAL SENDER This email originated from an external source.
> Please exercise caution before opening attachments, clicking links,
> replying, or providing information to the sender. If you believe it to be
> fraudulent, contact the AU Cybersecurity Hotline at 72-CYBER (2-9237 /
> 706-722-9237) or 72CYBER at augusta.edu
>
> The most efficient way about this is to give one or more of the
> Engineers working on OpenStack OVN upstream (I've added a few to this
> thread) temporary access to an environment that can reproduce issues
> you're seeing, we could then document the issues and work towards
> solutions. If that's not possible, if you could provide reproducer
> scripts, or alternatively sharpen the reproduction method, we'll take
> a look. What you've described is not something that's 'acceptable',
> OVN should definitely not scale worse than Neutron with the Linux
> Bridge agent. It's possible that the particular issues you ran in to
> is something that we've already seen internally at Red Hat, or with
> our customers, and we're already working on fixes in future versions
> of OVN - I can't tell you until you elaborate on the details of the
> issues you're seeing. In any case, the upstream community is committed
> to improving OVN scale and fixing scale issues as they pop up.
> Coincidentally, Red Hat scale engineers just published an article [1]
> about work they've done to scale RH-OSP 16.1 (== OpenStack Train on
> CentOS 8, with OVN 2.13 and TripleO) to 700 compute nodes.
>
> [1]
> https://www.redhat.com/en/blog/scaling-red-hat-openstack-platform-161-more-700-nodes?source=bloglisting
>
> On Thu, Aug 27, 2020 at 10:44 AM Apsey, Christopher <CAPSEY at augusta.edu>
> wrote:
> >
> > All,
> >
> >
> >
> > I know that OVN is going to become the default neutron backend at some
> point and displace linuxbridge as the default configuration option in the
> docs, but we have noticed a pretty significant performance disparity
> between OVN and linuxbridge on identical hardware over the past year or so
> in a few different environments[1]. I know that example is unscientific,
> but similar results have been borne out in many different scenarios from
> what we have observed. There are three main problems from what we see:
> >
> >
> >
> > 1. OVN does not handle large concurrent requests as well as linuxbridge.
> Additionally, linuxbridge concurrent capacity grows (not linearly, but
> grows nonetheless) by adding additional neutron API endpoints and RPC
> agents. OVN does not really horizontally scale by adding additional API
> endpoints, from what we have observed.
> >
> > 2. OVN gets significantly slower as load on the system grows. We have
> observed a soft cap of about 2000-2500 instances in a given deployment
> before ovn-backed neutron stops responding altogether to nova requests
> (even for booting a single instance). We have observed linuxbridge get to
> 5000+ instances before it starts to struggle on the same hardware (and we
> think that linuxbridge can go further with improved provider network design
> in that particular case).
> >
> > 3. Once the southbound database process hits 100% CPU usage on the
> leader in the ovn cluster, it’s game over (probably causes 1+2)
> >
> >
> >
> > It's entirely possible that we just don’t understand OVN well enough to
> tune it [2][3][4], but then the question becomes how do we get that tuning
> knowledge into the docs so people don’t scratch their heads when their cool
> new OVN deployment scales 40% as well as their ancient linuxbridge-based
> one?
> >
> >
> >
> > If it is ‘known’ that OVN has some scaling challenges, is there a plan
> to fix it, and what is the best way to contribute to doing so?
> >
> >
> >
> > We have observed similar results on Ubuntu 18.04/20.04 and CentOS 7/8 on
> Stein, Train, and Ussuri.
> >
> >
> >
> > [1] https://pastebin.com/kyyURTJm
> >
> > [2] https://github.com/GeorgiaCyber/kinetic/tree/master/formulas/ovsdb
> >
> > [3] https://github.com/GeorgiaCyber/kinetic/tree/master/formulas/neutron
> >
> > [4] https://github.com/GeorgiaCyber/kinetic/tree/master/formulas/compute
> >
> >
> >
> > Chris Apsey
> >
> > GEORGIA CYBER CENTER
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200902/32a092a5/attachment-0001.html>


More information about the openstack-discuss mailing list