SNAT failure with OVN under Antelope
Good morning, I'm having a problem with snat routing under OVN but I'm not sure if something is mis-configured or just my understanding of how OVN is architected is wrong. I've built a Zed cloud, since upgraded to Antelope, using the Neutron Manual install method here: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html I'm using a multi-tenent configuration using geneve and the flat provider network is present on each hypervisor. Each hypervisor is connected to the physical provider network, along with the tenent network and is tagged as an external chassis under OVN. br-int exists, as does br-provider ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw For most cases, distributed FIP based connectivity is working without issue, but I'm having an issue where VMs without a FIP are not always able to use the SNAT services of the tenent network router. Scenario: Internal network named cs3319: with subnet 172.31.100.0/23 Has a router named cs3319_router with external gateway set (snat enabled) This network has 3 vms: - #1 has a FIP and can be accessed externally - #2 has no FIP, can be accessed via VM1 and can access external resources via SNAT (ie OS repos, DNS, etc) - #3 has no FIP, can be accessed via VM1 but has no external SNAT connectivity From what I can tell, the chassis config is correct, compute05 is the hypervisor and the faulty VM has a port binding on this hypervisor: ovn-sbctl show ... Chassis "8e0fa17c-e480-4b60-9015-bd8833412561" hostname: compute05.cloud.sci.uwo.ca Encap geneve ip: "192.168.0.105" options: {csum="true"} Port_Binding "7a5257eb-caea-45bf-b48c-620c5dff4b39" Port_Binding "50e16602-78e6-429b-8c2f-e7e838ece1b4" Port_Binding "f121c9f4-c3fe-4ea9-b754-a809be95a3fd" The router has the candidate gateways, and the snat set: ovn-nbctl show 92df19a7-4ebe-43ea-b233-f4e9f5a46e7c router 92df19a7-4ebe-43ea-b233-f4e9f5a46e7c (neutron-389439b5-07f8-44b6-a35b-c76651b48be5) (aka cs3319_public_router) port lrp-44ae1753-845e-4822-9e3d-a41e0469e257 mac: "fa:16:3e:9a:db:d8" networks: ["129.100.21.94/22"] gateway chassis: [5c039d38-70b2-4ee6-9df1-596f82c68106 99facd23-ad17-4b68-a8c2-1ff6da15ac5f 1694116c-6d30-4c31-b5ea-0f411878316e 2a4bbaf9-228a-462e-8970-0cdbf59086e6 9332c61b-93e1-4a70-9547-701a014bfd98] port lrp-509bba37-fa06-42d6-9210-2342045490db mac: "fa:16:3e:ff:0f:3b" networks: ["172.31.100.1/23"] nat 11e0565a-4695-4f67-b4ee-101f1b1b9a4f external ip: "129.100.21.94" logical ip: "172.31.100.0/23" type: "snat" nat 21e4be02-d81c-46e8-8fa8-3f94edb4aed1 external ip: "129.100.21.87" logical ip: "172.31.100.49" type: "dnat_and_snat" Each network agent on the hypervisors shows the ovn controller up : OVN Controller Gateway agent | compute05.cloud.sci.uwo.ca | | :-) | UP | ovn-controller The ovs vswitch on the hypervisor looks correct afaict and ovn ports bfd status are all forwarding to other hypervisors. ie: Port ovn-2a4bba-0 Interface ovn-2a4bba-0 type: geneve options: {csum="true", key=flow, remote_ip="192.168.0.106"} bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up} Any advice on where to look would be appreciated. PS. Version info: Neutron 22.0.0-1 OVN 22.12 neutron options: enable_distributed_floating_ip = true ovn_l3_scheduler = leastloaded Thanks Gary -- Gary Molenkamp Science Technology Services Systems/Cloud Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
Hello Gary: If you have 2 VMs in the same network, both without FIPs and one is working but not the other, I would just compare 1:1 the Neutron and OVN resources of both ports (I guess both VMs have one single port). I would start with the OVN NAT registers. You should also check the internal VM routing table. Apart from that, you should also trace the VM traffic, to know where it is dropped. Maybe the traffic is sent correctly outside the GW port but never gets back (in that case, check you underlying network configuration). Or, as you commented, the SNAT is not working for this specific port. Regards. On Tue, Jun 27, 2023 at 1:38 PM Gary Molenkamp <molenkam@uwo.ca> wrote:
Good morning, I'm having a problem with snat routing under OVN but I'm not sure if something is mis-configured or just my understanding of how OVN is architected is wrong.
I've built a Zed cloud, since upgraded to Antelope, using the Neutron Manual install method here: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html I'm using a multi-tenent configuration using geneve and the flat provider network is present on each hypervisor. Each hypervisor is connected to the physical provider network, along with the tenent network and is tagged as an external chassis under OVN. br-int exists, as does br-provider ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
For most cases, distributed FIP based connectivity is working without issue, but I'm having an issue where VMs without a FIP are not always able to use the SNAT services of the tenent network router. Scenario: Internal network named cs3319: with subnet 172.31.100.0/23 Has a router named cs3319_router with external gateway set (snat enabled)
This network has 3 vms: - #1 has a FIP and can be accessed externally - #2 has no FIP, can be accessed via VM1 and can access external resources via SNAT (ie OS repos, DNS, etc) - #3 has no FIP, can be accessed via VM1 but has no external SNAT connectivity
From what I can tell, the chassis config is correct, compute05 is the hypervisor and the faulty VM has a port binding on this hypervisor:
ovn-sbctl show ... Chassis "8e0fa17c-e480-4b60-9015-bd8833412561" hostname: compute05.cloud.sci.uwo.ca Encap geneve ip: "192.168.0.105" options: {csum="true"} Port_Binding "7a5257eb-caea-45bf-b48c-620c5dff4b39" Port_Binding "50e16602-78e6-429b-8c2f-e7e838ece1b4" Port_Binding "f121c9f4-c3fe-4ea9-b754-a809be95a3fd"
The router has the candidate gateways, and the snat set:
ovn-nbctl show 92df19a7-4ebe-43ea-b233-f4e9f5a46e7c router 92df19a7-4ebe-43ea-b233-f4e9f5a46e7c (neutron-389439b5-07f8-44b6-a35b-c76651b48be5) (aka cs3319_public_router) port lrp-44ae1753-845e-4822-9e3d-a41e0469e257 mac: "fa:16:3e:9a:db:d8" networks: ["129.100.21.94/22"] gateway chassis: [5c039d38-70b2-4ee6-9df1-596f82c68106 99facd23-ad17-4b68-a8c2-1ff6da15ac5f 1694116c-6d30-4c31-b5ea-0f411878316e 2a4bbaf9-228a-462e-8970-0cdbf59086e6 9332c61b-93e1-4a70-9547-701a014bfd98] port lrp-509bba37-fa06-42d6-9210-2342045490db mac: "fa:16:3e:ff:0f:3b" networks: ["172.31.100.1/23"] nat 11e0565a-4695-4f67-b4ee-101f1b1b9a4f external ip: "129.100.21.94" logical ip: "172.31.100.0/23" type: "snat" nat 21e4be02-d81c-46e8-8fa8-3f94edb4aed1 external ip: "129.100.21.87" logical ip: "172.31.100.49" type: "dnat_and_snat"
Each network agent on the hypervisors shows the ovn controller up : OVN Controller Gateway agent | compute05.cloud.sci.uwo.ca | | :-) | UP | ovn-controller
The ovs vswitch on the hypervisor looks correct afaict and ovn ports bfd status are all forwarding to other hypervisors. ie: Port ovn-2a4bba-0 Interface ovn-2a4bba-0 type: geneve options: {csum="true", key=flow, remote_ip="192.168.0.106"} bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up}
Any advice on where to look would be appreciated.
PS. Version info: Neutron 22.0.0-1 OVN 22.12
neutron options: enable_distributed_floating_ip = true ovn_l3_scheduler = leastloaded
Thanks Gary
-- Gary Molenkamp Science Technology Services Systems/Cloud Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
Hi Gary, On top what Rodolfo said On Tue, Jun 27, 2023 at 5:15 PM Gary Molenkamp <molenkam@uwo.ca> wrote:
Good morning, I'm having a problem with snat routing under OVN but I'm not sure if something is mis-configured or just my understanding of how OVN is architected is wrong.
I've built a Zed cloud, since upgraded to Antelope, using the Neutron Manual install method here: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html I'm using a multi-tenent configuration using geneve and the flat provider network is present on each hypervisor. Each hypervisor is connected to the physical provider network, along with the tenent network and is tagged as an external chassis under OVN. br-int exists, as does br-provider ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
For most cases, distributed FIP based connectivity is working without issue, but I'm having an issue where VMs without a FIP are not always able to use the SNAT services of the tenent network router. Scenario: Internal network named cs3319: with subnet 172.31.100.0/23 Has a router named cs3319_router with external gateway set (snat enabled)
This network has 3 vms: - #1 has a FIP and can be accessed externally - #2 has no FIP, can be accessed via VM1 and can access external resources via SNAT (ie OS repos, DNS, etc) - #3 has no FIP, can be accessed via VM1 but has no external SNAT connectivity
Considering it works for some vm but for some not, the above point for enable-chassis-as-gw could be related. The working vm is hosted on compute05 or some other compute node? Where is
Any specific reason to enable gateway on compute nodes? Generally it's recommended to use controller/network nodes as gateway. What's your env(number of controllers, network, compute nodes)? the gateway router port scheduled(can check ovn-sbctl show for cr-lrp-<router gateway port id>)?
From what I can tell, the chassis config is correct, compute05 is the hypervisor and the faulty VM has a port binding on this hypervisor:
ovn-sbctl show ... Chassis "8e0fa17c-e480-4b60-9015-bd8833412561" hostname: compute05.cloud.sci.uwo.ca Encap geneve ip: "192.168.0.105" options: {csum="true"} Port_Binding "7a5257eb-caea-45bf-b48c-620c5dff4b39" Port_Binding "50e16602-78e6-429b-8c2f-e7e838ece1b4" Port_Binding "f121c9f4-c3fe-4ea9-b754-a809be95a3fd"
The router has the candidate gateways, and the snat set:
ovn-nbctl show 92df19a7-4ebe-43ea-b233-f4e9f5a46e7c router 92df19a7-4ebe-43ea-b233-f4e9f5a46e7c (neutron-389439b5-07f8-44b6-a35b-c76651b48be5) (aka cs3319_public_router) port lrp-44ae1753-845e-4822-9e3d-a41e0469e257 mac: "fa:16:3e:9a:db:d8" networks: ["129.100.21.94/22"] gateway chassis: [5c039d38-70b2-4ee6-9df1-596f82c68106 99facd23-ad17-4b68-a8c2-1ff6da15ac5f 1694116c-6d30-4c31-b5ea-0f411878316e 2a4bbaf9-228a-462e-8970-0cdbf59086e6 9332c61b-93e1-4a70-9547-701a014bfd98] port lrp-509bba37-fa06-42d6-9210-2342045490db mac: "fa:16:3e:ff:0f:3b" networks: ["172.31.100.1/23"] nat 11e0565a-4695-4f67-b4ee-101f1b1b9a4f external ip: "129.100.21.94" logical ip: "172.31.100.0/23" type: "snat" nat 21e4be02-d81c-46e8-8fa8-3f94edb4aed1 external ip: "129.100.21.87" logical ip: "172.31.100.49" type: "dnat_and_snat"
Each network agent on the hypervisors shows the ovn controller up : OVN Controller Gateway agent | compute05.cloud.sci.uwo.ca | | :-) | UP | ovn-controller
The ovs vswitch on the hypervisor looks correct afaict and ovn ports bfd status are all forwarding to other hypervisors. ie: Port ovn-2a4bba-0 Interface ovn-2a4bba-0 type: geneve options: {csum="true", key=flow, remote_ip="192.168.0.106"} bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up}
Any advice on where to look would be appreciated.
I have seen mtu specific issues in the past, would be good to rule out any mtu issue with working and non working cases.
PS. Version info:
Neutron 22.0.0-1 OVN 22.12
neutron options: enable_distributed_floating_ip = true ovn_l3_scheduler = leastloaded
Thanks Gary
-- Gary Molenkamp Science Technology Services Systems/Cloud Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
Thanks and Regards
Yatin Karel
Hi Gary, Em ter., 27 de jun. de 2023 às 11:47, Yatin Karel <ykarel@redhat.com> escreveu:
Hi Gary,
On top what Rodolfo said On Tue, Jun 27, 2023 at 5:15 PM Gary Molenkamp <molenkam@uwo.ca> wrote:
Good morning, I'm having a problem with snat routing under OVN but I'm not sure if something is mis-configured or just my understanding of how OVN is architected is wrong.
I've built a Zed cloud, since upgraded to Antelope, using the Neutron Manual install method here: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html I'm using a multi-tenent configuration using geneve and the flat provider network is present on each hypervisor. Each hypervisor is connected to the physical provider network, along with the tenent network and is tagged as an external chassis under OVN. br-int exists, as does br-provider ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
Any specific reason to enable gateway on compute nodes? Generally it's recommended to use controller/network nodes as gateway. What's your env(number of controllers, network, compute nodes)?
Wouldn't it be interesting to enable-chassis-as-gw on the compute nodes, just in case you want to use DVR: If that's the case, you need to map the external bridge (ovs-vsctl set open . external-ids:ovn-bridge-mappings=...) via ansible this is created automatically, but in the manual installation I didn't see any mention of it. The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides. ovs-vsctl remove open . external-ids ovn-cms-options="enable-chassis-as-gw" ovs-vsctl remove open . external-ids ovn-bridge-mappings ip link set br-provider-name down ovs-vsctl del-br br-provider-name systemctl restart ovn-controller systemctl restart openvswitch-switch
For most cases, distributed FIP based connectivity is working without issue, but I'm having an issue where VMs without a FIP are not always able to use the SNAT services of the tenent network router. Scenario: Internal network named cs3319: with subnet 172.31.100.0/23 Has a router named cs3319_router with external gateway set (snat enabled)
This network has 3 vms: - #1 has a FIP and can be accessed externally - #2 has no FIP, can be accessed via VM1 and can access external resources via SNAT (ie OS repos, DNS, etc) - #3 has no FIP, can be accessed via VM1 but has no external SNAT connectivity
Considering it works for some vm but for some not, the above point for enable-chassis-as-gw could be related. The working vm is hosted on compute05 or some other compute node? Where is the gateway router port scheduled(can check ovn-sbctl show for cr-lrp-<router gateway port id>)?
From what I can tell, the chassis config is correct, compute05 is the hypervisor and the faulty VM has a port binding on this hypervisor:
ovn-sbctl show ... Chassis "8e0fa17c-e480-4b60-9015-bd8833412561" hostname: compute05.cloud.sci.uwo.ca Encap geneve ip: "192.168.0.105" options: {csum="true"} Port_Binding "7a5257eb-caea-45bf-b48c-620c5dff4b39" Port_Binding "50e16602-78e6-429b-8c2f-e7e838ece1b4" Port_Binding "f121c9f4-c3fe-4ea9-b754-a809be95a3fd"
The router has the candidate gateways, and the snat set:
ovn-nbctl show 92df19a7-4ebe-43ea-b233-f4e9f5a46e7c router 92df19a7-4ebe-43ea-b233-f4e9f5a46e7c (neutron-389439b5-07f8-44b6-a35b-c76651b48be5) (aka cs3319_public_router) port lrp-44ae1753-845e-4822-9e3d-a41e0469e257 mac: "fa:16:3e:9a:db:d8" networks: ["129.100.21.94/22"] gateway chassis: [5c039d38-70b2-4ee6-9df1-596f82c68106 99facd23-ad17-4b68-a8c2-1ff6da15ac5f 1694116c-6d30-4c31-b5ea-0f411878316e 2a4bbaf9-228a-462e-8970-0cdbf59086e6 9332c61b-93e1-4a70-9547-701a014bfd98] port lrp-509bba37-fa06-42d6-9210-2342045490db mac: "fa:16:3e:ff:0f:3b" networks: ["172.31.100.1/23"] nat 11e0565a-4695-4f67-b4ee-101f1b1b9a4f external ip: "129.100.21.94" logical ip: "172.31.100.0/23" type: "snat" nat 21e4be02-d81c-46e8-8fa8-3f94edb4aed1 external ip: "129.100.21.87" logical ip: "172.31.100.49" type: "dnat_and_snat"
Each network agent on the hypervisors shows the ovn controller up : OVN Controller Gateway agent | compute05.cloud.sci.uwo.ca | | :-) | UP | ovn-controller
The ovs vswitch on the hypervisor looks correct afaict and ovn ports bfd status are all forwarding to other hypervisors. ie: Port ovn-2a4bba-0 Interface ovn-2a4bba-0 type: geneve options: {csum="true", key=flow, remote_ip="192.168.0.106"} bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up}
Any advice on where to look would be appreciated.
I have seen mtu specific issues in the past, would be good to rule out any mtu issue with working and non working cases.
PS. Version info:
Neutron 22.0.0-1 OVN 22.12
neutron options: enable_distributed_floating_ip = true ovn_l3_scheduler = leastloaded
Thanks Gary
-- Gary Molenkamp Science Technology Services Systems/Cloud Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
Thanks and Regards
Yatin Karel
-- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
On 2023-06-27 11:18, Roberto Bartzen Acosta wrote:
Hi Gary,
Em ter., 27 de jun. de 2023 às 11:47, Yatin Karel <ykarel@redhat.com> escreveu:
Hi Gary,
On top what Rodolfo said On Tue, Jun 27, 2023 at 5:15 PM Gary Molenkamp <molenkam@uwo.ca> wrote:
Good morning, I'm having a problem with snat routing under OVN but I'm not sure if something is mis-configured or just my understanding of how OVN is architected is wrong.
I've built a Zed cloud, since upgraded to Antelope, using the Neutron Manual install method here: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html I'm using a multi-tenent configuration using geneve and the flat provider network is present on each hypervisor. Each hypervisor is connected to the physical provider network, along with the tenent network and is tagged as an external chassis under OVN. br-int exists, as does br-provider ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
Any specific reason to enable gateway on compute nodes? Generally it's recommended to use controller/network nodes as gateway. What's your env(number of controllers, network, compute nodes)?
Wouldn't it be interesting to enable-chassis-as-gw on the compute nodes, just in case you want to use DVR: If that's the case, you need to map the external bridge (ovs-vsctl set open . external-ids:ovn-bridge-mappings=...) via ansible this is created automatically, but in the manual installation I didn't see any mention of it.
Our intention was to distribute the routing on our OVN cloud to take advantage of DVR as our provider network is just a tagged vlan in our physical infrastructure. This avoids requiring dedicated network node(s) and fewer bottlenecks. I had not set up any ovn-bridge-mappings as it was not mentioned in the manual install. I will look into it.
The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.
I forced a similar VM onto the same chassis as the working VM, and it was able to communicate out. If we do want to keep multiple chassis' as gateways, would that be addressed with the ovn-bridge-mappings?
ovs-vsctl remove open . external-ids ovn-cms-options="enable-chassis-as-gw" ovs-vsctl remove open . external-ids ovn-bridge-mappings ip link set br-provider-name down ovs-vsctl del-br br-provider-namesystemctl restart ovn-controller systemctl restart openvswitch-switch
-- Gary Molenkamp Science Technology Services Systems Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
Em ter., 27 de jun. de 2023 às 14:20, Gary Molenkamp <molenkam@uwo.ca> escreveu:
On 2023-06-27 11:18, Roberto Bartzen Acosta wrote:
Hi Gary,
Em ter., 27 de jun. de 2023 às 11:47, Yatin Karel <ykarel@redhat.com> escreveu:
Hi Gary,
On top what Rodolfo said On Tue, Jun 27, 2023 at 5:15 PM Gary Molenkamp <molenkam@uwo.ca> wrote:
Good morning, I'm having a problem with snat routing under OVN but I'm not sure if something is mis-configured or just my understanding of how OVN is architected is wrong.
I've built a Zed cloud, since upgraded to Antelope, using the Neutron Manual install method here: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html I'm using a multi-tenent configuration using geneve and the flat provider network is present on each hypervisor. Each hypervisor is connected to the physical provider network, along with the tenent network and is tagged as an external chassis under OVN. br-int exists, as does br-provider ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
Any specific reason to enable gateway on compute nodes? Generally it's recommended to use controller/network nodes as gateway. What's your env(number of controllers, network, compute nodes)?
Wouldn't it be interesting to enable-chassis-as-gw on the compute nodes, just in case you want to use DVR: If that's the case, you need to map the external bridge (ovs-vsctl set open . external-ids:ovn-bridge-mappings=...) via ansible this is created automatically, but in the manual installation I didn't see any mention of it.
Our intention was to distribute the routing on our OVN cloud to take advantage of DVR as our provider network is just a tagged vlan in our physical infrastructure. This avoids requiring dedicated network node(s) and fewer bottlenecks. I had not set up any ovn-bridge-mappings as it was not mentioned in the manual install. I will look into it.
The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.
I forced a similar VM onto the same chassis as the working VM, and it was able to communicate out. If we do want to keep multiple chassis' as gateways, would that be addressed with the ovn-bridge-mappings?
Verify your ml2 config file: cat /etc/neutron/plugins/ml2/ml2_conf.ini [ml2_type_vlan] network_vlan_ranges = vlan:101:200,vlan:301:400 Note the name used to map the vlan ranges: in this example = vlan. On compute nodes, check if exists or create the external bridge (usually called br-provider): ovs-vsctl --no-wait -- --may-exist add-br br-provider -- set bridge br-provider protocols=OpenFlow10,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 ovs-vsctl --no-wait br-set-external-id br-provider bridge-id br-provider Set the ovn-bridge-mappings using the network_vlan_ranges name and the ovs external bridge name (it is exactly the same configuration applied in gw node): ovs-vsctl set open . external-ids:ovn-bridge-mappings=vlan:br-provider Don't forget to enable DVR for FIPs...: vi /etc/neutron/plugins/ml2/ml2_conf.ini [ovn] enable_distributed_floating_ip = True
ovs-vsctl remove open . external-ids ovn-cms-options="enable-chassis-as-gw" ovs-vsctl remove open . external-ids ovn-bridge-mappings ip link set br-provider-name down ovs-vsctl del-br br-provider-name systemctl restart ovn-controller systemctl restart openvswitch-switch
-- Gary Molenkamp Science Technology Services Systems Administrator University of Western Ontariomolenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
-- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
Happy Friday afternoon. I'm still pondering a lack of connectivity in an HA OVN with each compute node acting as a potential gateway chassis.
The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.
I forced a similar VM onto the same chassis as the working VM, and it was able to communicate out. If we do want to keep multiple chassis' as gateways, would that be addressed with the ovn-bridge-mappings?
I built a small test cloud to explore this further as I continue to see the same issue: A vm will only be able to use SNAT outbound if it is on the same chassis as the CR-LRP. In my test cloud, I have one controller, and two compute nodes. The controller only runs the north and southd in addition to the neutron server. Each of the two compute nodes is configured as below. On a tenent network I have three VMs: - #1: cirros VM with FIP - #2: cirros VM running on compute node 1 - #3: cirros VM running on compute node 2 E/W traffic between VMs in the same tenent network are fine. N/S traffic is fine for the FIP. N/S traffic only works for the VM whose CR-LRP is active on same chassis. Does anything jump out as a mistake in my understanding at to how this should be working? Thanks as always, Gary on each hypervisor: /usr/bin/ovs-vsctl set open . external-ids:ovn-remote=tcp:{{ controllerip }}:6642 /usr/bin/ovs-vsctl set open . external-ids:ovn-encap-type=geneve /usr/bin/ovs-vsctl set open . external-ids:ovn-encap-ip={{ overlaynetip }} /usr/bin/ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw /usr/bin/ovs-vsctl add-br br-provider -- set bridge br-provider protocols=OpenFlow10,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 /usr/bin/ovs-vsctl add-port br-provider {{ provider_nic }} /usr/bin/ovs-vsctl br-set-external-id provider bridge-id br-provider /usr/bin/ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider plugin.ini: [ml2] mechanism_drivers = ovn type_drivers = flat,geneve tenant_network_types = geneve extension_drivers = port_security overlay_ip_version = 4 [ml2_type_flat] flat_networks = provider [ml2_type_geneve] vni_ranges = 1:65536 max_header_size = 38 [securitygroup] enable_security_group = True firewall_driver = neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver [ovn] ovn_nb_connection = tcp:{{controllerip}}:6641 ovn_sb_connection = tcp:{{controllerip}}:6642 ovn_l3_scheduler = leastloaded ovn_metadata_enabled = True enable_distributed_floating_ip = true -- Gary Molenkamp Science Technology Services Systems Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
For comparison, I looked at how openstack-ansible was setting up OVN and I don't see any major differences other than O-A configures a manager for ovs: ovs-vsctl --id @manager create Manager "target=\ .... I don't believe this is the point of failure (but feel free to correct me if I'm wrong ;) ). ovn-trace on both VM's inports shows the same trace for the working VM and the non-working VM. ie: ovn-trace --db=$SB --ovs default_net 'inport == "f4cbc8c7-e7bf-47f3-9fea-a1663f6eb34d" && eth.src==fa:16:3e:a6:62:8e && ip4.src == 172.31.101.168 && ip4.dst == <provider's gateway IP>' On 2023-07-07 14:08, Gary Molenkamp wrote:
Happy Friday afternoon.
I'm still pondering a lack of connectivity in an HA OVN with each compute node acting as a potential gateway chassis.
The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.
I forced a similar VM onto the same chassis as the working VM, and it was able to communicate out. If we do want to keep multiple chassis' as gateways, would that be addressed with the ovn-bridge-mappings?
I built a small test cloud to explore this further as I continue to see the same issue: A vm will only be able to use SNAT outbound if it is on the same chassis as the CR-LRP.
In my test cloud, I have one controller, and two compute nodes. The controller only runs the north and southd in addition to the neutron server. Each of the two compute nodes is configured as below. On a tenent network I have three VMs: - #1: cirros VM with FIP - #2: cirros VM running on compute node 1 - #3: cirros VM running on compute node 2
E/W traffic between VMs in the same tenent network are fine. N/S traffic is fine for the FIP. N/S traffic only works for the VM whose CR-LRP is active on same chassis. Does anything jump out as a mistake in my understanding at to how this should be working?
Thanks as always, Gary
on each hypervisor:
/usr/bin/ovs-vsctl set open . external-ids:ovn-remote=tcp:{{ controllerip }}:6642 /usr/bin/ovs-vsctl set open . external-ids:ovn-encap-type=geneve /usr/bin/ovs-vsctl set open . external-ids:ovn-encap-ip={{ overlaynetip }} /usr/bin/ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw /usr/bin/ovs-vsctl add-br br-provider -- set bridge br-provider protocols=OpenFlow10,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 /usr/bin/ovs-vsctl add-port br-provider {{ provider_nic }} /usr/bin/ovs-vsctl br-set-external-id provider bridge-id br-provider /usr/bin/ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider
plugin.ini: [ml2] mechanism_drivers = ovn type_drivers = flat,geneve tenant_network_types = geneve extension_drivers = port_security overlay_ip_version = 4 [ml2_type_flat] flat_networks = provider [ml2_type_geneve] vni_ranges = 1:65536 max_header_size = 38 [securitygroup] enable_security_group = True firewall_driver = neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver [ovn] ovn_nb_connection = tcp:{{controllerip}}:6641 ovn_sb_connection = tcp:{{controllerip}}:6642 ovn_l3_scheduler = leastloaded ovn_metadata_enabled = True enable_distributed_floating_ip = true
-- Gary Molenkamp Science Technology Services Systems Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
-- Gary Molenkamp Science Technology Services Systems Engineer University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
A little progress, but I may be tripping over bug https://bugs.launchpad.net/neutron/+bug/2003455 If I remove the provider bridge from the second hypervisor: ovs-vsctl remove open . external-ids ovn-cms-options="enable-chassis-as-gw" ovs-vsctl remove open . external-ids ovn-bridge-mappings ip link set br-provider down ovs-vsctl del-br br-provider and disable enable_distributed_floating_ip Then both VMs using SNAT on each compute server work. Turning the second chassis back on as a gateway immediately breaks the VM on the second compute server: ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw ovs-vsctl add-br br-provider ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider ovs-vsctl add-port br-provider ens256 systemctl restart ovn-controller openvswitch.service I am running neutron 22.0.1 but maybe something related? python3-neutron-22.0.1-1.el9s.noarch openstack-neutron-common-22.0.1-1.el9s.noarch openstack-neutron-22.0.1-1.el9s.noarch openstack-neutron-ml2-22.0.1-1.el9s.noarch openstack-neutron-openvswitch-22.0.1-1.el9s.noarch openstack-neutron-ovn-metadata-agent-22.0.1-1.el9s.noarch On 2023-07-12 10:21, Gary Molenkamp wrote:
For comparison, I looked at how openstack-ansible was setting up OVN and I don't see any major differences other than O-A configures a manager for ovs: ovs-vsctl --id @manager create Manager "target=\ .... I don't believe this is the point of failure (but feel free to correct me if I'm wrong ;) ).
ovn-trace on both VM's inports shows the same trace for the working VM and the non-working VM. ie:
ovn-trace --db=$SB --ovs default_net 'inport == "f4cbc8c7-e7bf-47f3-9fea-a1663f6eb34d" && eth.src==fa:16:3e:a6:62:8e && ip4.src == 172.31.101.168 && ip4.dst == <provider's gateway IP>'
On 2023-07-07 14:08, Gary Molenkamp wrote:
Happy Friday afternoon.
I'm still pondering a lack of connectivity in an HA OVN with each compute node acting as a potential gateway chassis.
The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.
I forced a similar VM onto the same chassis as the working VM, and it was able to communicate out. If we do want to keep multiple chassis' as gateways, would that be addressed with the ovn-bridge-mappings?
I built a small test cloud to explore this further as I continue to see the same issue: A vm will only be able to use SNAT outbound if it is on the same chassis as the CR-LRP.
In my test cloud, I have one controller, and two compute nodes. The controller only runs the north and southd in addition to the neutron server. Each of the two compute nodes is configured as below. On a tenent network I have three VMs: - #1: cirros VM with FIP - #2: cirros VM running on compute node 1 - #3: cirros VM running on compute node 2
E/W traffic between VMs in the same tenent network are fine. N/S traffic is fine for the FIP. N/S traffic only works for the VM whose CR-LRP is active on same chassis. Does anything jump out as a mistake in my understanding at to how this should be working?
Thanks as always, Gary
on each hypervisor:
/usr/bin/ovs-vsctl set open . external-ids:ovn-remote=tcp:{{ controllerip }}:6642 /usr/bin/ovs-vsctl set open . external-ids:ovn-encap-type=geneve /usr/bin/ovs-vsctl set open . external-ids:ovn-encap-ip={{ overlaynetip }} /usr/bin/ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw /usr/bin/ovs-vsctl add-br br-provider -- set bridge br-provider protocols=OpenFlow10,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 /usr/bin/ovs-vsctl add-port br-provider {{ provider_nic }} /usr/bin/ovs-vsctl br-set-external-id provider bridge-id br-provider /usr/bin/ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider
plugin.ini: [ml2] mechanism_drivers = ovn type_drivers = flat,geneve tenant_network_types = geneve extension_drivers = port_security overlay_ip_version = 4 [ml2_type_flat] flat_networks = provider [ml2_type_geneve] vni_ranges = 1:65536 max_header_size = 38 [securitygroup] enable_security_group = True firewall_driver = neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver [ovn] ovn_nb_connection = tcp:{{controllerip}}:6641 ovn_sb_connection = tcp:{{controllerip}}:6642 ovn_l3_scheduler = leastloaded ovn_metadata_enabled = True enable_distributed_floating_ip = true
-- Gary Molenkamp Science Technology Services Systems Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
-- Gary Molenkamp Science Technology Services Systems Engineer University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
-- Gary Molenkamp Science Technology Services Systems Engineer University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
Hi Gary, On Wed, Jul 12, 2023 at 9:22 PM Gary Molenkamp <molenkam@uwo.ca> wrote:
A little progress, but I may be tripping over bug https://bugs.launchpad.net/neutron/+bug/2003455
That bug was mostly targeting vlan provider networks but you mentioned you using geneve and flat networks so this might not be related.
Multiple components involved so it would be difficult to narrow it down here without much details as functionality wise it would have just worked(in my Train OVN environment i checked it worked fine). So I think it would be best to start with a bug report at https://bugs.launchpad.net/neutron/ with details(by reverting the env to previous state bridges, ovn-cms options configured and DVR enabled). Good to include details like:- - Environment details:- - Number of controller, computes nodes - Nodes are virtual or physical - Deployment tool used, Operating System - Neutron version - OVN/OVS versions - Share ovn-controller logs from the compute and controller node - Share OVN Northbound and Southbound DB files from the controller node and ovs conf.db from compute nodes - Output of resources involved:- - openstack network agent list - openstack server list --long - openstack port list --router <router id> - Reproduction steps along with output from the operations(both with good and bad vms) - Output of below commands from controller and compute nodes:- - iptables -L - netstat -i - ip addr show - ovs-vsctl show - ovs-vsctl list open .
If I remove the provider bridge from the second hypervisor: ovs-vsctl remove open . external-ids ovn-cms-options="enable-chassis-as-gw" ovs-vsctl remove open . external-ids ovn-bridge-mappings ip link set br-provider down ovs-vsctl del-br br-provider and disable enable_distributed_floating_ip
Then both VMs using SNAT on each compute server work.
This looks interesting. Would be good to also check the behavior when no VM has FIP attached.
Turning the second chassis back on as a gateway immediately breaks the VM on the second compute server:
ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw ovs-vsctl add-br br-provider ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider ovs-vsctl add-port br-provider ens256 systemctl restart ovn-controller openvswitch.service
Here it would be interesting to check where exactly traffic drops using tcpdump.
I am running neutron 22.0.1 but maybe something related?
python3-neutron-22.0.1-1.el9s.noarch openstack-neutron-common-22.0.1-1.el9s.noarch openstack-neutron-22.0.1-1.el9s.noarch openstack-neutron-ml2-22.0.1-1.el9s.noarch openstack-neutron-openvswitch-22.0.1-1.el9s.noarch openstack-neutron-ovn-metadata-agent-22.0.1-1.el9s.noarch
On 2023-07-12 10:21, Gary Molenkamp wrote:
For comparison, I looked at how openstack-ansible was setting up OVN and I don't see any major differences other than O-A configures a manager for ovs: ovs-vsctl --id @manager create Manager "target=\ .... I don't believe this is the point of failure (but feel free to correct me if I'm wrong ;) ).
ovn-trace on both VM's inports shows the same trace for the working VM and the non-working VM. ie:
ovn-trace --db=$SB --ovs default_net 'inport == "f4cbc8c7-e7bf-47f3-9fea-a1663f6eb34d" && eth.src==fa:16:3e:a6:62:8e && ip4.src == 172.31.101.168 && ip4.dst == <provider's gateway IP>'
On 2023-07-07 14:08, Gary Molenkamp wrote:
Happy Friday afternoon.
I'm still pondering a lack of connectivity in an HA OVN with each compute node acting as a potential gateway chassis.
The problem is basically that the port of the OVN LRP may not be in the
same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.
I forced a similar VM onto the same chassis as the working VM, and it was able to communicate out. If we do want to keep multiple chassis' as gateways, would that be addressed with the ovn-bridge-mappings?
I built a small test cloud to explore this further as I continue to see the same issue: A vm will only be able to use SNAT outbound if it is on the same chassis as the CR-LRP.
In my test cloud, I have one controller, and two compute nodes. The controller only runs the north and southd in addition to the neutron server. Each of the two compute nodes is configured as below. On a tenent network I have three VMs: - #1: cirros VM with FIP - #2: cirros VM running on compute node 1 - #3: cirros VM running on compute node 2
E/W traffic between VMs in the same tenent network are fine. N/S traffic is fine for the FIP. N/S traffic only works for the VM whose CR-LRP is active on same chassis. Does anything jump out as a mistake in my understanding at to how this should be working?
Thanks as always, Gary
on each hypervisor:
/usr/bin/ovs-vsctl set open . external-ids:ovn-remote=tcp:{{ controllerip }}:6642 /usr/bin/ovs-vsctl set open . external-ids:ovn-encap-type=geneve /usr/bin/ovs-vsctl set open . external-ids:ovn-encap-ip={{ overlaynetip }} /usr/bin/ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw /usr/bin/ovs-vsctl add-br br-provider -- set bridge br-provider protocols=OpenFlow10,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 /usr/bin/ovs-vsctl add-port br-provider {{ provider_nic }} /usr/bin/ovs-vsctl br-set-external-id provider bridge-id br-provider /usr/bin/ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider
plugin.ini: [ml2] mechanism_drivers = ovn type_drivers = flat,geneve tenant_network_types = geneve extension_drivers = port_security overlay_ip_version = 4 [ml2_type_flat] flat_networks = provider [ml2_type_geneve] vni_ranges = 1:65536 max_header_size = 38 [securitygroup] enable_security_group = True firewall_driver = neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver [ovn] ovn_nb_connection = tcp:{{controllerip}}:6641 ovn_sb_connection = tcp:{{controllerip}}:6642 ovn_l3_scheduler = leastloaded ovn_metadata_enabled = True enable_distributed_floating_ip = true
-- Gary Molenkamp Science Technology Services Systems Administrator University of Western Ontariomolenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
-- Gary Molenkamp Science Technology Services Systems Engineer University of Western Ontariomolenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
-- Gary Molenkamp Science Technology Services Systems Engineer University of Western Ontariomolenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
Thanks and Regards Yatin Karel
Thanks Yartin, I will put together a bug report. I have found that if I disable enable_distributed_floating_ip, but leave the entire OVN/OVS setup as below for redundancy, then traffic flows as expected. As soon as I set enable_distributed_floating_ip to true, E/W remains, but the N/S traffic stops for the VMs not on the host with the CR-LRP. I can't say for sure why as ovn-trace/flow debugging is still new to me, but the north and south dbs look correct. Gary On 2023-07-13 11:43, Yatin Karel wrote:
Hi Gary,
On Wed, Jul 12, 2023 at 9:22 PM Gary Molenkamp <molenkam@uwo.ca> wrote:
A little progress, but I may be tripping over bug https://bugs.launchpad.net/neutron/+bug/2003455
That bug was mostly targeting vlan provider networks but you mentioned you using geneve and flat networks so this might not be related.
Multiple components involved so it would be difficult to narrow it down here without much details as functionality wise it would have just worked(in my Train OVN environment i checked it worked fine). So I think it would be best to start with a bug report at https://bugs.launchpad.net/neutron/ with details(by reverting the env to previous state bridges, ovn-cms options configured and DVR enabled). Good to include details like:-
- Environment details:- - Number of controller, computes nodes - Nodes are virtual or physical - Deployment tool used, Operating System - Neutron version - OVN/OVS versions - Share ovn-controller logs from the compute and controller node - Share OVN Northbound and Southbound DB files from the controller node and ovs conf.db from compute nodes - Output of resources involved:- - openstack network agent list - openstack server list --long - openstack port list --router <router id> - Reproduction steps along with output from the operations(both with good and bad vms) - Output of below commands from controller and compute nodes:- - iptables -L - netstat -i - ip addr show - ovs-vsctl show - ovs-vsctl list open .
If I remove the provider bridge from the second hypervisor: ovs-vsctl remove open . external-ids ovn-cms-options="enable-chassis-as-gw" ovs-vsctl remove open . external-ids ovn-bridge-mappings ip link set br-provider down ovs-vsctl del-br br-provider and disable enable_distributed_floating_ip
Then both VMs using SNAT on each compute server work.
This looks interesting. Would be good to also check the behavior when no VM has FIP attached.
Turning the second chassis back on as a gateway immediately breaks the VM on the second compute server:
ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw ovs-vsctl add-br br-provider ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider ovs-vsctl add-port br-provider ens256 systemctl restart ovn-controller openvswitch.service
Here it would be interesting to check where exactly traffic drops using tcpdump.
I am running neutron 22.0.1 but maybe something related?
python3-neutron-22.0.1-1.el9s.noarch openstack-neutron-common-22.0.1-1.el9s.noarch openstack-neutron-22.0.1-1.el9s.noarch openstack-neutron-ml2-22.0.1-1.el9s.noarch openstack-neutron-openvswitch-22.0.1-1.el9s.noarch openstack-neutron-ovn-metadata-agent-22.0.1-1.el9s.noarch
On 2023-07-12 10:21, Gary Molenkamp wrote:
For comparison, I looked at how openstack-ansible was setting up OVN and I don't see any major differences other than O-A configures a manager for ovs: ovs-vsctl --id @manager create Manager "target=\ .... I don't believe this is the point of failure (but feel free to correct me if I'm wrong ;) ).
ovn-trace on both VM's inports shows the same trace for the working VM and the non-working VM. ie:
ovn-trace --db=$SB --ovs default_net 'inport == "f4cbc8c7-e7bf-47f3-9fea-a1663f6eb34d" && eth.src==fa:16:3e:a6:62:8e && ip4.src == 172.31.101.168 && ip4.dst == <provider's gateway IP>'
On 2023-07-07 14:08, Gary Molenkamp wrote:
Happy Friday afternoon.
I'm still pondering a lack of connectivity in an HA OVN with each compute node acting as a potential gateway chassis.
The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.
I forced a similar VM onto the same chassis as the working VM, and it was able to communicate out. If we do want to keep multiple chassis' as gateways, would that be addressed with the ovn-bridge-mappings?
I built a small test cloud to explore this further as I continue to see the same issue: A vm will only be able to use SNAT outbound if it is on the same chassis as the CR-LRP.
In my test cloud, I have one controller, and two compute nodes. The controller only runs the north and southd in addition to the neutron server. Each of the two compute nodes is configured as below. On a tenent network I have three VMs: - #1: cirros VM with FIP - #2: cirros VM running on compute node 1 - #3: cirros VM running on compute node 2
E/W traffic between VMs in the same tenent network are fine. N/S traffic is fine for the FIP. N/S traffic only works for the VM whose CR-LRP is active on same chassis. Does anything jump out as a mistake in my understanding at to how this should be working?
Thanks as always, Gary
on each hypervisor:
/usr/bin/ovs-vsctl set open . external-ids:ovn-remote=tcp:{{ controllerip }}:6642 /usr/bin/ovs-vsctl set open . external-ids:ovn-encap-type=geneve /usr/bin/ovs-vsctl set open . external-ids:ovn-encap-ip={{ overlaynetip }} /usr/bin/ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw /usr/bin/ovs-vsctl add-br br-provider -- set bridge br-provider protocols=OpenFlow10,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15 /usr/bin/ovs-vsctl add-port br-provider {{ provider_nic }} /usr/bin/ovs-vsctl br-set-external-id provider bridge-id br-provider /usr/bin/ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider
plugin.ini: [ml2] mechanism_drivers = ovn type_drivers = flat,geneve tenant_network_types = geneve extension_drivers = port_security overlay_ip_version = 4 [ml2_type_flat] flat_networks = provider [ml2_type_geneve] vni_ranges = 1:65536 max_header_size = 38 [securitygroup] enable_security_group = True firewall_driver = neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver [ovn] ovn_nb_connection = tcp:{{controllerip}}:6641 ovn_sb_connection = tcp:{{controllerip}}:6642 ovn_l3_scheduler = leastloaded ovn_metadata_enabled = True enable_distributed_floating_ip = true
-- Gary Molenkamp Science Technology Services Systems Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
-- Gary Molenkamp Science Technology Services Systems Engineer University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
-- Gary Molenkamp Science Technology Services Systems Engineer University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
Thanks and Regards Yatin Karel
-- Gary Molenkamp Science Technology Services Systems Engineer University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
hello, might have something to do with the firewall_driver in your plugin.ini ?
plugin.ini: [ml2] mechanism_drivers = ovn type_drivers = flat,geneve tenant_network_types = geneve extension_drivers = port_security overlay_ip_version = 4 [ml2_type_flat] flat_networks = provider [ml2_type_geneve] vni_ranges = 1:65536 max_header_size = 38 [securitygroup] enable_security_group = True firewall_driver = neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver [ovn] ovn_nb_connection = tcp:{{controllerip}}:6641 ovn_sb_connection = tcp:{{controllerip}}:6642 ovn_l3_scheduler = leastloaded ovn_metadata_enabled = True enable_distributed_floating_ip = true
Marc
Thanks for the pointers, itlooks like I'm starting to narrow it down. Something still confusing me, though.
I've built a Zed cloud, since upgraded to Antelope, using the Neutron Manual install method here: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html I'm using a multi-tenent configuration using geneve and the flat provider network is present on each hypervisor. Each hypervisor is connected to the physical provider network, along with the tenent network and is tagged as an external chassis under OVN. br-int exists, as does br-provider ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
Any specific reason to enable gateway on compute nodes? Generally it's recommended to use controller/network nodes as gateway. What's your env(number of controllers, network, compute nodes)?
Wouldn't it be interesting to enable-chassis-as-gw on the compute nodes, just in case you want to use DVR: If that's the case, you need to map the external bridge (ovs-vsctl set open . external-ids:ovn-bridge-mappings=...) via ansible this is created automatically, but in the manual installation I didn't see any mention of it. The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.
ovs-vsctl remove open . external-ids ovn-cms-options="enable-chassis-as-gw" ovs-vsctl remove open . external-ids ovn-bridge-mappings ip link set br-provider-name down ovs-vsctl del-br br-provider-namesystemctl restart ovn-controller systemctl restart openvswitch-switch
How does one support both use-case types? If I want to use DVR via each compute node, then I must create the br-provider bridge, set the chassis as a gateway and map the bridge. This seems to be breaking forwarding to the OVN LRP. The hypervisor/VM with the working LRP works but any other hypervisor is not tunneling via Geneve. Thanks as always, this is very informative. Gary -- Gary Molenkamp Science Technology Services Systems Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
Em ter., 27 de jun. de 2023 às 15:22, Gary Molenkamp <molenkam@uwo.ca> escreveu:
Thanks for the pointers, itlooks like I'm starting to narrow it down. Something still confusing me, though.
I've built a Zed cloud, since upgraded to Antelope, using the Neutron Manual install method here: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html I'm using a multi-tenent configuration using geneve and the flat provider network is present on each hypervisor. Each hypervisor is connected to the physical provider network, along with the tenent network and is tagged as an external chassis under OVN. br-int exists, as does br-provider ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
Any specific reason to enable gateway on compute nodes? Generally it's recommended to use controller/network nodes as gateway. What's your env(number of controllers, network, compute nodes)?
Wouldn't it be interesting to enable-chassis-as-gw on the compute nodes, just in case you want to use DVR: If that's the case, you need to map the external bridge (ovs-vsctl set open . external-ids:ovn-bridge-mappings=...) via ansible this is created automatically, but in the manual installation I didn't see any mention of it.
The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.
ovs-vsctl remove open . external-ids ovn-cms-options="enable-chassis-as-gw" ovs-vsctl remove open . external-ids ovn-bridge-mappings ip link set br-provider-name down ovs-vsctl del-br br-provider-name systemctl restart ovn-controller systemctl restart openvswitch-switch
How does one support both use-case types?
If I want to use DVR via each compute node, then I must create the br-provider bridge, set the chassis as a gateway and map the bridge. This seems to be breaking forwarding to the OVN LRP. The hypervisor/VM with the working LRP works but any other hypervisor is not tunneling via Geneve.
https://docs.openstack.org/neutron/zed/ovn/faq/index.html The E/W traffic is "completely distributed in all cases." for OVN driver... It is natively supported and should work via openflow / tunneling / Geneve without any issues. The problem is that when you set the enable-chassis-as-gw flag you enable gateway router port scheduling for a chassis that may not have an external bridge mapped (and this breaks external traffic). You can trace the traffic where the VM is and check where it is breaking via datapath command: ovs-dpctl dump-flows But if you are facing problems on east/west traffic, please check your OVN settings (example): ovs-vsctl list open_vswitch - external_ids : {ovn-encap-ip="192.168.200.10", ovn-encap-type="geneve", ovn-remote="tcp:192.168.200.200:6642"}) ...and make sure geneve tunnels are established between all hypervisors (example): root@comp1:~# ovs-vsctl show Bridge br-int .... Port ovn-2e4ed2-0 Interface ovn-2e4ed2-0 type: geneve options: {csum="true", key=flow, remote_ip="192.168.200.11"} Port ovn-fc7744-0 Interface ovn-fc7744-0 type: geneve options: {csum="true", key=flow, remote_ip="192.168.200.30"}
Thanks as always, this is very informative.
Gary
-- Gary Molenkamp Science Technology Services Systems Administrator University of Western Ontariomolenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
-- _‘Esta mensagem é direcionada apenas para os endereços constantes no cabeçalho inicial. Se você não está listado nos endereços constantes no cabeçalho, pedimos-lhe que desconsidere completamente o conteúdo dessa mensagem e cuja cópia, encaminhamento e/ou execução das ações citadas estão imediatamente anuladas e proibidas’._ * **‘Apesar do Magazine Luiza tomar todas as precauções razoáveis para assegurar que nenhum vírus esteja presente nesse e-mail, a empresa não poderá aceitar a responsabilidade por quaisquer perdas ou danos causados por esse e-mail ou por seus anexos’.*
On 2023-06-27 15:02, Roberto Bartzen Acosta wrote:
Em ter., 27 de jun. de 2023 às 15:22, Gary Molenkamp <molenkam@uwo.ca> escreveu:
Thanks for the pointers, itlooks like I'm starting to narrow it down. Something still confusing me, though.
I've built a Zed cloud, since upgraded to Antelope, using the Neutron Manual install method here: https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html I'm using a multi-tenent configuration using geneve and the flat provider network is present on each hypervisor. Each hypervisor is connected to the physical provider network, along with the tenent network and is tagged as an external chassis under OVN. br-int exists, as does br-provider ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
Any specific reason to enable gateway on compute nodes? Generally it's recommended to use controller/network nodes as gateway. What's your env(number of controllers, network, compute nodes)?
Wouldn't it be interesting to enable-chassis-as-gw on the compute nodes, just in case you want to use DVR: If that's the case, you need to map the external bridge (ovs-vsctl set open . external-ids:ovn-bridge-mappings=...) via ansible this is created automatically, but in the manual installation I didn't see any mention of it. The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.
ovs-vsctl remove open . external-ids ovn-cms-options="enable-chassis-as-gw" ovs-vsctl remove open . external-ids ovn-bridge-mappings ip link set br-provider-name down ovs-vsctl del-br br-provider-namesystemctl restart ovn-controller systemctl restart openvswitch-switch
How does one support both use-case types?
If I want to use DVR via each compute node, then I must create the br-provider bridge, set the chassis as a gateway and map the bridge. This seems to be breaking forwarding to the OVN LRP. The hypervisor/VM with the working LRP works but any other hypervisor is not tunneling via Geneve.
https://docs.openstack.org/neutron/zed/ovn/faq/index.html The E/W traffic is "completely distributed in all cases." for OVN driver... It is natively supported and should work via openflow / tunneling / Geneve without any issues.
The problem is that when you set the enable-chassis-as-gw flag you enable gateway router port scheduling for a chassis that may not have an external bridge mapped (and this breaks external traffic).
E/W traffic looks good and each compute shows forwarding connections to the other compute. Each compute has the proper external bridge mapped. ie: external_ids : {hostname=compute05.cloud.sci.uwo.ca, ovn-bridge-mappings="provider:br-provider", ovn-cms-options=enable-chassis-as-gw, ovn-encap-ip="192.168.0.105", ovn-encap-type=geneve, ovn-remote="tcp:172.31.102.100:6642", rundir="/var/run/openvswitch", system-id="8e0fa17c-e480-4b60-9015-bd8833412561"} Likewise all geneve tunnels between the compute nodes are established. -- Gary Molenkamp Science Technology Services Systems Administrator University of Western Ontario molenkam@uwo.ca http://sts.sci.uwo.ca (519) 661-2111 x86882 (519) 661-3566
participants (5)
-
Gary Molenkamp
-
Marc Gariepy
-
Roberto Bartzen Acosta
-
Rodolfo Alonso Hernandez
-
Yatin Karel