So just for testing I've applied this patch to our neutron-server: --8<--8<--8<-- diff --git a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py index 23a841d7a1..41200786f1 100644 --- a/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py +++ b/neutron/plugins/ml2/drivers/ovn/mech_driver/ovsdb/ovn_client.py @@ -1141,11 +1141,15 @@ class OVNClient(object): enabled = router.get('admin_state_up') lrouter_name = utils.ovn_name(router['id']) added_gw_port = None + options = { + "always_learn_from_arp_request": "false", + "dynamic_neigh_routers": "true" + } with self._nb_idl.transaction(check_error=True) as txn: txn.add(self._nb_idl.create_lrouter(lrouter_name, external_ids=external_ids, enabled=enabled, - options={})) + options=options)) # TODO(lucasagomes): add_external_gateway is being only used # by the ovn_db_sync.py script, remove it after the database # synchronization work --8<--8<--8<-- and also executed that for each logical router in OVN: # ovn-nbctl set Logical_Router $router options=dynamic_neigh_routers=true,always_learn_from_arp_request=false This had a huge impact on both a number of logical flows and a number of ovs flows on chassis nodes: --8<--8<--8<-- # cat lflows-new.txt |grep -v Datapath |cut -d'(' -f 2 | cut -d ')' -f1 |sort | uniq -c |sort -n | tail -10 2170 ls_out_port_sec_l2 2172 lr_in_learn_neighbor 2666 lr_in_admission 2690 ls_in_port_sec_l2 3190 lr_in_ip_routing 4276 lr_in_lookup_neighbor 4873 lr_in_arp_resolve 5864 ls_in_arp_rsp 5873 ls_in_l2_lkup 14343 lr_in_ip_input # ovn-sbctl --timeout=120 lflow-list > lflows-new.txt --8<--8<--8<-- (and this is even more routers than before - 500 vs 400). I'll have to read what impact do those options have on ARP activity though. -- Krzysztof Klimonda kklimonda@syntaxhighlighted.com On Thu, Sep 17, 2020, at 21:14, Krzysztof Klimonda wrote:
Hi Tony,
Indeed I forgot to mention that all routers are using the same external network (and subnet) for the external gateway.
Creating separate external networks per router wouldn't really work for us, and I'm not even quite sure what the setup would look like in that case.
-- Krzysztof Klimonda kklimonda@syntaxhighlighted.com
On Thu, Sep 17, 2020, at 20:31, Tony Liu wrote:
I am trying to reach 5000. The problem I hit is that northd is stuck in translating from NB to SB when connect router to external network.
I assume all your 400 routers connect to the same subnet in that external network. I am trying another approach where one subnet is created for each router in external network. That may help to reduce the ARP flow?
Thanks! Tony
-----Original Message----- From: Krzysztof Klimonda <kklimonda@syntaxhighlighted.com> Sent: Thursday, September 17, 2020 8:57 AM To: openstack-discuss@lists.openstack.org Subject: [neutron][ovn] Logical flow scaling (flow explosion in lr_in_arp_resolve)
Hi,
We're running some tests of ussuri deployment with ovn ML2 driver and seeing some worrying numbers of logical flows generated for our test deployment.
As a test, we create 400 routes, 400 private networks and connect each network to its own routers. We also connect each router to an external network. After doing that a dump of logical flows shows almost 800k logical flows, most of them in lr_in_arp_resolve table:
--8<--8<--8<-- # cat lflows.txt |grep -v Datapath |cut -d'(' -f 2 | cut -d ')' -f1 |sort | uniq -c |sort -n | tail -10 3264 lr_in_learn_neighbor 3386 ls_out_port_sec_l2 4112 lr_in_admission 4202 ls_in_port_sec_l2 4898 lr_in_lookup_neighbor 4900 lr_in_ip_routing 9144 ls_in_l2_lkup 9160 ls_in_arp_rsp 22136 lr_in_ip_input 671656 lr_in_arp_resolve # --8<--8<--8<--
ovn: 20.06.2 + patch for SNAT IP ARP reply issue openvswitch: 2.13.0 neutron: 16.1.0
I've seen some discussion about similar issue at OVS mailing lists: https://www.mail-archive.com/ovs-discuss@openvswitch.org/msg07014.html - is this relevant to neutron, and not just kubernetes?
-- Krzysztof Klimonda kklimonda@syntaxhighlighted.com