ovn-controller/OVS stranger behaviour
Hi all, I'm trying to understand a stranger's behaviour regarding to ovn-controller/OVS. In my setup I have OVN 21.09/ OVS 2.16 and Ubuntu Xena and sometimes when a new VM is created, this VM can reach other VMs in east-west traffic (even in differents Chassis) but it can't reach an external network (e.g. Internet) through Chassi Gateway. I ran the following trace: # ovs-appctl ofproto/trace br-int in_port="93",icmp,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_ttl=64 And I got this output: Final flow: recirc_id=0xc157b1,eth,icmp,reg0=0x300,reg11=0xd,reg12=0x10,reg13=0xf,reg14=0x3,reg15=0x2,metadata=0x29,in_port=93,vlan_tci=0x0000,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 Megaflow: recirc_id=0xc157b1,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,in_port=93,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src= 192.168.40.128/26,nw_dst=8.0.0.0/7,nw_ttl=64,nw_frag=no Datapath actions: ct(commit,zone=15,label=0/0x1,nat(src)),set(eth(src=fa:16:3e:ec:7f:dd,dst=00:00:00:00:00:00)),set(ipv4(ttl=63)),userspace(pid=3451843211,controller(reason=1,dont_send=1,continuation=0,recirc_id=12670898,rule_cookie=0x3e26215e,controller_id=0,max_len=65535)) It seems the Datapath is querying the controller and I did not understand the reason. So, I did an ovn-controller recompute (ovn-appctl -t ovn-controller recompute) on the Chassi where the VM is placed to check if it could change the behaviour and I could trace the packet with success and the VM started to communicate with the Internet normally: Final flow: recirc_id=0x2,eth,icmp,reg0=0x300,reg11=0xd,reg12=0x10,reg13=0xf,reg14=0x3,reg15=0x2,metadata=0x29,in_port=93,vlan_tci=0x0000,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 Megaflow: recirc_id=0x2,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,tun_id=0/0xffffff,tun_metadata0=NP,in_port=93,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src= 192.168.40.128/26,nw_dst=8.0.0.0/7,nw_ecn=0,nw_ttl=64,nw_frag=no Datapath actions: ct(commit,zone=15,label=0/0x1,nat(src)),set(tunnel(tun_id=0x2a,dst=10.X6.X3.133,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(df|csum|key))),set(eth(src=fa:16:3e:ec:7f:dd,dst=00:00:5e:00:04:00)),set(ipv4(ttl=63)),2 The Datapath action is using the tunnel with the Chassi Gateway. It happens always with new VMs but sometimes. After running the recompute on the Chassi, I created additional VMs and this issue did not happen. In my Chassi I have enable these parameters also: ovn-monitor-all="true" ovn-openflow-probe-interval="0" ovn-remote-probe-interval="180000" I did some troubleshooting and I'm seeing this error (ovs-vswitchd) always when a VM is created in a Chassi: 2022-06-23T11:47:08.385Z|07907|bridge|WARN|could not open network device tap8a43df0c-fd (No such device) 2022-06-23T11:47:09.282Z|07908|bridge|INFO|bridge br-int: added interface tap8a43df0c-fd on port 51 2022-06-23T11:47:09.645Z|07909|bridge|INFO|bridge br-int: added interface tap3200bf1c-20 on port 52 2022-06-23T11:47:19.329Z|07911|connmgr|INFO|br-int<->unix#1468: 430 flow_mods in the 7 s starting 10 s ago (410 adds, 20 deletes) On this commit http://patchwork.ozlabs.org/project/ovn/patch/1608197000-637-1-git-send-emai... it solved something similar to my issue. It seems the ovs-vswitchd is missing some flows and when I run the recompute it fixes it. So, in order to avoid this issue I'm testing at this moment to run the recompute through libvirt hook when a VM gets "started" status. Do you know this behaviour could be bug related? Regards, Tiago Pires Do you know this behaviour could be bug related?
Hi, On Thu, Jun 23, 2022 at 9:30 PM Tiago Pires <tiagohp@gmail.com> wrote:
Hi all,
I'm trying to understand a stranger's behaviour regarding to ovn-controller/OVS. In my setup I have OVN 21.09/ OVS 2.16 and Ubuntu Xena and sometimes when a new VM is created, this VM can reach other VMs in east-west traffic (even in differents Chassis) but it can't reach an external network (e.g. Internet) through Chassi Gateway. I ran the following trace: # ovs-appctl ofproto/trace br-int in_port="93",icmp,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_ttl=64
And I got this output:
Final flow: recirc_id=0xc157b1,eth,icmp,reg0=0x300,reg11=0xd,reg12=0x10,reg13=0xf,reg14=0x3,reg15=0x2,metadata=0x29,in_port=93,vlan_tci=0x0000,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 Megaflow: recirc_id=0xc157b1,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,in_port=93,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.128/26,nw_dst=8.0.0.0/7,nw_ttl=64,nw_frag=no Datapath actions: ct(commit,zone=15,label=0/0x1,nat(src)),set(eth(src=fa:16:3e:ec:7f:dd,dst=00:00:00:00:00:00)),set(ipv4(ttl=63)),userspace(pid=3451843211,controller(reason=1,dont_send=1,continuation=0,recirc_id=12670898,rule_cookie=0x3e26215e,controller_id=0,max_len=65535)) It seems the Datapath is querying the controller and I did not understand the reason.
So, I did an ovn-controller recompute (ovn-appctl -t ovn-controller recompute) on the Chassi where the VM is placed to check if it could change the behaviour and I could trace the packet with success and the VM started to communicate with the Internet normally:
Final flow: recirc_id=0x2,eth,icmp,reg0=0x300,reg11=0xd,reg12=0x10,reg13=0xf,reg14=0x3,reg15=0x2,metadata=0x29,in_port=93,vlan_tci=0x0000,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 Megaflow: recirc_id=0x2,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,tun_id=0/0xffffff,tun_metadata0=NP,in_port=93,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.128/26,nw_dst=8.0.0.0/7,nw_ecn=0,nw_ttl=64,nw_frag=no Datapath actions: ct(commit,zone=15,label=0/0x1,nat(src)),set(tunnel(tun_id=0x2a,dst=10.X6.X3.133,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(df|csum|key))),set(eth(src=fa:16:3e:ec:7f:dd,dst=00:00:5e:00:04:00)),set(ipv4(ttl=63)),2 The Datapath action is using the tunnel with the Chassi Gateway.
This sounds like a bug in the ovn-controller to me. The fact that it worked after a recompute which forces ovn-controller to recalculate all flows tells me that there may be a bug in the "incremental processing" mechanism (a mechanism that calculates the changes based on deltas).
It happens always with new VMs but sometimes. After running the recompute on the Chassi, I created additional VMs and this issue did not happen.
In my Chassi I have enable these parameters also: ovn-monitor-all="true" ovn-openflow-probe-interval="0" ovn-remote-probe-interval="180000"
I did some troubleshooting and I'm seeing this error (ovs-vswitchd) always when a VM is created in a Chassi: 2022-06-23T11:47:08.385Z|07907|bridge|WARN|could not open network device tap8a43df0c-fd (No such device) 2022-06-23T11:47:09.282Z|07908|bridge|INFO|bridge br-int: added interface tap8a43df0c-fd on port 51 2022-06-23T11:47:09.645Z|07909|bridge|INFO|bridge br-int: added interface tap3200bf1c-20 on port 52 2022-06-23T11:47:19.329Z|07911|connmgr|INFO|br-int<->unix#1468: 430 flow_mods in the 7 s starting 10 s ago (410 adds, 20 deletes)
Hmm... At a first glance it does not look related to the issue you are experiencing but, core OVN or OVS experts may know better.
On this commit http://patchwork.ozlabs.org/project/ovn/patch/1608197000-637-1-git-send-emai... it solved something similar to my issue. It seems the ovs-vswitchd is missing some flows and when I run the recompute it fixes it.
Right yeah, we've seen a few bugs related to the incremental processing mechanism in the past. Things are much more stable nowadays but you may be hitting a new one.
So, in order to avoid this issue I'm testing at this moment to run the recompute through libvirt hook when a VM gets "started" status.
Do you know this behaviour could be bug related?
Regards,
Tiago Pires
Do you know this behaviour could be bug related?
Hi, I was going to forward this ML to the ovs-discuss ML (the ML that the core OVN folks watches) but I see that you already posted it there and already got some suggestions. I think we should continue the discussion over there, from an OpenStack perspective I don't think there's much we can do here because clearly ML2/OVN is updating the OVN NB DB accordingly but ovn-controller is not picking up the changes and installing the appropriate flows. On Fri, Jun 24, 2022 at 10:13 AM Lucas Alvares Gomes <lucasagomes@gmail.com> wrote:
Hi,
On Thu, Jun 23, 2022 at 9:30 PM Tiago Pires <tiagohp@gmail.com> wrote:
Hi all,
I'm trying to understand a stranger's behaviour regarding to ovn-controller/OVS. In my setup I have OVN 21.09/ OVS 2.16 and Ubuntu Xena and sometimes when a new VM is created, this VM can reach other VMs in east-west traffic (even in differents Chassis) but it can't reach an external network (e.g. Internet) through Chassi Gateway. I ran the following trace: # ovs-appctl ofproto/trace br-int in_port="93",icmp,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_ttl=64
And I got this output:
Final flow: recirc_id=0xc157b1,eth,icmp,reg0=0x300,reg11=0xd,reg12=0x10,reg13=0xf,reg14=0x3,reg15=0x2,metadata=0x29,in_port=93,vlan_tci=0x0000,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 Megaflow: recirc_id=0xc157b1,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,in_port=93,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.128/26,nw_dst=8.0.0.0/7,nw_ttl=64,nw_frag=no Datapath actions: ct(commit,zone=15,label=0/0x1,nat(src)),set(eth(src=fa:16:3e:ec:7f:dd,dst=00:00:00:00:00:00)),set(ipv4(ttl=63)),userspace(pid=3451843211,controller(reason=1,dont_send=1,continuation=0,recirc_id=12670898,rule_cookie=0x3e26215e,controller_id=0,max_len=65535)) It seems the Datapath is querying the controller and I did not understand the reason.
So, I did an ovn-controller recompute (ovn-appctl -t ovn-controller recompute) on the Chassi where the VM is placed to check if it could change the behaviour and I could trace the packet with success and the VM started to communicate with the Internet normally:
Final flow: recirc_id=0x2,eth,icmp,reg0=0x300,reg11=0xd,reg12=0x10,reg13=0xf,reg14=0x3,reg15=0x2,metadata=0x29,in_port=93,vlan_tci=0x0000,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.140,nw_dst=8.8.8.8,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0 Megaflow: recirc_id=0x2,ct_state=+new-est-rel-rpl-inv+trk,ct_label=0/0x1,eth,icmp,tun_id=0/0xffffff,tun_metadata0=NP,in_port=93,dl_src=fa:16:3e:26:34:ef,dl_dst=fa:16:3e:65:68:6e,nw_src=192.168.40.128/26,nw_dst=8.0.0.0/7,nw_ecn=0,nw_ttl=64,nw_frag=no Datapath actions: ct(commit,zone=15,label=0/0x1,nat(src)),set(tunnel(tun_id=0x2a,dst=10.X6.X3.133,ttl=64,tp_dst=6081,geneve({class=0x102,type=0x80,len=4,0x30002}),flags(df|csum|key))),set(eth(src=fa:16:3e:ec:7f:dd,dst=00:00:5e:00:04:00)),set(ipv4(ttl=63)),2 The Datapath action is using the tunnel with the Chassi Gateway.
This sounds like a bug in the ovn-controller to me. The fact that it worked after a recompute which forces ovn-controller to recalculate all flows tells me that there may be a bug in the "incremental processing" mechanism (a mechanism that calculates the changes based on deltas).
It happens always with new VMs but sometimes. After running the recompute on the Chassi, I created additional VMs and this issue did not happen.
In my Chassi I have enable these parameters also: ovn-monitor-all="true" ovn-openflow-probe-interval="0" ovn-remote-probe-interval="180000"
I did some troubleshooting and I'm seeing this error (ovs-vswitchd) always when a VM is created in a Chassi: 2022-06-23T11:47:08.385Z|07907|bridge|WARN|could not open network device tap8a43df0c-fd (No such device) 2022-06-23T11:47:09.282Z|07908|bridge|INFO|bridge br-int: added interface tap8a43df0c-fd on port 51 2022-06-23T11:47:09.645Z|07909|bridge|INFO|bridge br-int: added interface tap3200bf1c-20 on port 52 2022-06-23T11:47:19.329Z|07911|connmgr|INFO|br-int<->unix#1468: 430 flow_mods in the 7 s starting 10 s ago (410 adds, 20 deletes)
Hmm... At a first glance it does not look related to the issue you are experiencing but, core OVN or OVS experts may know better.
On this commit http://patchwork.ozlabs.org/project/ovn/patch/1608197000-637-1-git-send-emai... it solved something similar to my issue. It seems the ovs-vswitchd is missing some flows and when I run the recompute it fixes it.
Right yeah, we've seen a few bugs related to the incremental processing mechanism in the past. Things are much more stable nowadays but you may be hitting a new one.
So, in order to avoid this issue I'm testing at this moment to run the recompute through libvirt hook when a VM gets "started" status.
Do you know this behaviour could be bug related?
Regards,
Tiago Pires
Do you know this behaviour could be bug related?
participants (2)
-
Lucas Alvares Gomes
-
Tiago Pires