Re: [neutron][ovn] Logical flow scaling (flow explosion in lr_in_arp_resolve)

Krzysztof Klimonda kklimonda at syntaxhighlighted.com
Fri Sep 18 08:31:50 UTC 2020


So just for testing I've applied this patch to our neutron-server:

--8<--8<--8<--
diff --git a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py
index 23a841d7a1..41200786f1 100644
--- a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py
+++ b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py
@@ -1141,11 +1141,15 @@ class OVNClient(object):
         enabled = router.get('admin_state_up')
         lrouter_name = utils.ovn_name(router['id'])
         added_gw_port = None
+        options = {
+          "always_learn_from_arp_request": "false",
+          "dynamic_neigh_routers": "true"
+        }
         with self._nb_idl.transaction(check_error=True) as txn:
             txn.add(self._nb_idl.create_lrouter(lrouter_name,
                                                 external_ids=external_ids,
                                                 enabled=enabled,
-                                                options={}))
+                                                options=options))
             # TODO(lucasagomes): add_external_gateway is being only used
             # by the ovn_db_sync.py script, remove it after the database
             # synchronization work
--8<--8<--8<--

and also executed that for each logical router in OVN:

# ovn-nbctl set Logical_Router $router options=dynamic_neigh_routers=true,always_learn_from_arp_request=false

This had a huge impact on both a number of logical flows and a number of ovs flows on chassis nodes:

--8<--8<--8<--
# cat lflows-new.txt |grep -v Datapath |cut -d'(' -f 2 | cut -d ')' -f1 |sort | uniq -c |sort -n | tail -10
   2170 ls_out_port_sec_l2
   2172 lr_in_learn_neighbor
   2666 lr_in_admission
   2690 ls_in_port_sec_l2
   3190 lr_in_ip_routing
   4276 lr_in_lookup_neighbor
   4873 lr_in_arp_resolve
   5864 ls_in_arp_rsp
   5873 ls_in_l2_lkup
  14343 lr_in_ip_input
# ovn-sbctl --timeout=120 lflow-list > lflows-new.txt
--8<--8<--8<--

(and this is even more routers than before - 500 vs 400). I'll have to read what impact do those options have on ARP activity though.

-- 
  Krzysztof Klimonda
  kklimonda at syntaxhighlighted.com

On Thu, Sep 17, 2020, at 21:14, Krzysztof Klimonda wrote:
> Hi Tony,
> 
> Indeed I forgot to mention that all routers are using the same external 
> network (and subnet) for the external gateway.
> 
> Creating separate external networks per router wouldn't really work for 
> us, and I'm not even quite sure what the setup would look like in that 
> case. 
> 
> -- 
>   Krzysztof Klimonda
>   kklimonda at syntaxhighlighted.com
> 
> On Thu, Sep 17, 2020, at 20:31, Tony Liu wrote:
> > I am trying to reach 5000. The problem I hit is that northd is
> > stuck in translating from NB to SB when connect router to external
> > network.
> > 
> > I assume all your 400 routers connect to the same subnet in that
> > external network. I am trying another approach where one subnet
> > is created for each router in external network. That may help to
> > reduce the ARP flow?
> > 
> > Thanks!
> > Tony
> > > -----Original Message-----
> > > From: Krzysztof Klimonda <kklimonda at syntaxhighlighted.com>
> > > Sent: Thursday, September 17, 2020 8:57 AM
> > > To: openstack-discuss at lists.openstack.org
> > > Subject: [neutron][ovn] Logical flow scaling (flow explosion in
> > > lr_in_arp_resolve)
> > > 
> > > Hi,
> > > 
> > > We're running some tests of ussuri deployment with ovn ML2 driver and
> > > seeing some worrying numbers of logical flows generated for our test
> > > deployment.
> > > 
> > > As a test, we create 400 routes, 400 private networks and connect each
> > > network to its own routers. We also connect each router to an external
> > > network. After doing that a dump of logical flows shows almost 800k
> > > logical flows, most of them in lr_in_arp_resolve table:
> > > 
> > > --8<--8<--8<--
> > > # cat lflows.txt |grep -v Datapath |cut -d'(' -f 2 | cut -d ')' -f1
> > > |sort | uniq -c |sort -n | tail -10
> > >    3264 lr_in_learn_neighbor
> > >    3386 ls_out_port_sec_l2
> > >    4112 lr_in_admission
> > >    4202 ls_in_port_sec_l2
> > >    4898 lr_in_lookup_neighbor
> > >    4900 lr_in_ip_routing
> > >    9144 ls_in_l2_lkup
> > >    9160 ls_in_arp_rsp
> > >   22136 lr_in_ip_input
> > >  671656 lr_in_arp_resolve
> > > #
> > > --8<--8<--8<--
> > > 
> > > ovn: 20.06.2 + patch for SNAT IP ARP reply issue
> > > openvswitch: 2.13.0
> > > neutron: 16.1.0
> > > 
> > > I've seen some discussion about similar issue at OVS mailing lists:
> > > https://www.mail-archive.com/ovs-discuss@openvswitch.org/msg07014.html -
> > > is this relevant to neutron, and not just kubernetes?
> > > 
> > > --
> > >   Krzysztof Klimonda
> > >   kklimonda at syntaxhighlighted.com
> > 
> >
> 
>



More information about the openstack-discuss mailing list