> Hi all,
> I'm trying to understand a stranger's behaviour regarding to ovn-controller/OVS.
> In my setup I have OVN 21.09/ OVS 2.16 and Ubuntu Xena and sometimes when a new VM is created, this VM can reach other VMs in east-west traffic (even in differents Chassis) but it can't reach an external network (e.g. Internet) through Chassi Gateway.
> I ran the following trace:
> # ovs-appctl ofproto/trace br-int in_port="93",icmp,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=,nw_dst=,nw_ttl=64
> And I got this output:
> Final flow: recirc_id=0xc157b1,eth,icmp,reg0=0x300,reg11=0xd,reg12=0x10,reg13=0xf,reg14=0x3,reg15=0x2,metadata=0x29,in_port=93,vlan_tci=0x0000,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=,nw_dst=,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0
> Megaflow: recirc_id=0xc157b1,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,in_port=93,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=,nw_dst=,nw_ttl=64,nw_frag=no
> Datapath actions: ct(commit,zone=15,label=0/0x1,nat(src)),set(eth(src=fa:16:3e:ec:7f:dd,dst=00:00:00:00:00:00)),set(ipv4(ttl=63)),userspace(pid=3451843211,controller(reason=1,dont_send=1,continuation=0,recirc_id=12670898,rule_cookie=0x3e26215e,controller_id=0,max_len=65535))
> It seems the Datapath is querying the controller and I did not understand the reason.
> So, I did an ovn-controller recompute (ovn-appctl -t ovn-controller recompute) on the Chassi where the VM is placed to check if it could change the behaviour and I could trace the packet with success and the VM started to communicate with the Internet normally:
> Final flow: recirc_id=0x2,eth,icmp,reg0=0x300,reg11=0xd,reg12=0x10,reg13=0xf,reg14=0x3,reg15=0x2,metadata=0x29,in_port=93,vlan_tci=0x0000,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=,nw_dst=,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0
> Megaflow: recirc_id=0x2,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,tun_id=0/0xffffff,tun_metadata0=NP,in_port=93,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=,nw_dst=,nw_ecn=0,nw_ttl=64,nw_frag=no
> Datapath actions: ct(commit,zone=15,label=0/0x1,nat(src)),set(tunnel(tun_id=0x2a,dst=10.X6.X3.133,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(df|csum|key))),set(eth(src=fa:16:3e:ec:7f:dd,dst=00:00:5e:00:04:00)),set(ipv4(ttl=63)),2
> The Datapath action is using the tunnel with the Chassi Gateway.

This sounds like a bug in the ovn-controller to me. The fact that it
worked after a recompute which forces ovn-controller to recalculate
all flows tells me that there may be a bug in the "incremental
processing" mechanism (a mechanism that calculates the changes based
on deltas).

> It happens always with new VMs but sometimes. After running the recompute on the Chassi, I created additional VMs and this issue did not happen.
> In my Chassi I have enable these parameters also:
> ovn-monitor-all="true"
> ovn-openflow-probe-interval="0"
> ovn-remote-probe-interval="180000"
> I did some troubleshooting and I'm seeing this error (ovs-vswitchd) always when a VM is created in a Chassi:
> 2022-06-23T11:47:08.385Z|07907|bridge|WARN|could not open network device tap8a43df0c-fd (No such device)
> 2022-06-23T11:47:09.282Z|07908|bridge|INFO|bridge br-int: added interface tap8a43df0c-fd on port 51
> 2022-06-23T11:47:09.645Z|07909|bridge|INFO|bridge br-int: added interface tap3200bf1c-20 on port 52
> 2022-06-23T11:47:19.329Z|07911|connmgr|INFO|br-int<->unix#1468: 430 flow_mods in the 7 s starting 10 s ago (410 adds, 20 deletes)

Hmm... At a first glance it does not look related to the issue you are
experiencing but, core OVN or OVS experts may know better.

> On this commit http://patchwork.ozlabs.org/project/ovn/patch/1608197000-637-1-git-send-email-dceara@redhat.com/ it solved something similar to my issue. It seems the ovs-vswitchd is missing some flows and when I run the recompute it fixes it.

Right yeah, we've seen a few bugs related to the incremental
processing mechanism in the past. Things are much more stable nowadays
but you may be hitting a new one.

> So, in order to avoid this issue I'm testing at this moment to run the recompute through libvirt hook when a VM gets "started" status.
> Do you know this behaviour could be bug related?
> Regards,
> Tiago Pires
