For comparison, I looked at how openstack-ansible was setting up OVN and I don't see any major differences other than O-A configures a manager for ovs:
      ovs-vsctl --id @manager create Manager "target=\ ....
I don't believe this is the point of failure (but feel free to correct me if I'm wrong ;) ).

ovn-trace on both VM's inports shows the same trace for the working VM and the non-working VM. ie:

ovn-trace --db=$SB --ovs default_net 'inport == "f4cbc8c7-e7bf-47f3-9fea-a1663f6eb34d" && eth.src==fa:16:3e:a6:62:8e && ip4.src == 172.31.101.168 && ip4.dst == <provider's gateway IP>'



On 2023-07-07 14:08, Gary Molenkamp wrote:
Happy Friday afternoon.

I'm still pondering a lack of connectivity in an HA OVN with each compute node acting as a potential gateway chassis.

The problem is basically that the port of the OVN LRP may not be in the same chassis as the VM that failed (since the CR-LRP will be where the first VM of that network will be created). The suggestion is to remove the enable-chassis-as-gw from the compute nodes to allow the VM to forward traffic via tunneling/Geneve to the chassis where the LRP resides.


I forced a similar VM onto the same chassis as the working VM, and it was able to communicate out.    If we do want to keep multiple chassis' as gateways, would that be addressed with the ovn-bridge-mappings?


I built a small test cloud to explore this further as I continue to see the same issue:  A vm will only be able to use SNAT outbound if it is on the same chassis as the CR-LRP.

In my test cloud, I have one controller, and two compute nodes.  The controller only runs the north and southd in addition to the neutron server.  Each of the two compute nodes is configured as below.  On a tenent network I have three VMs:
    - #1:  cirros VM with FIP
    - #2:  cirros VM running on compute node 1
    - #3:  cirros VM running on compute node 2

E/W traffic between VMs in the same tenent network are fine.  N/S traffic is fine for the FIP.  N/S traffic only works for the VM whose CR-LRP is active on same chassis.   Does anything jump out as a mistake in my understanding at to how this should be working?

Thanks as always,
Gary


on each hypervisor:

/usr/bin/ovs-vsctl set open . external-ids:ovn-remote=tcp:{{ controllerip }}:6642
/usr/bin/ovs-vsctl set open . external-ids:ovn-encap-type=geneve
/usr/bin/ovs-vsctl set open . external-ids:ovn-encap-ip={{ overlaynetip }}
/usr/bin/ovs-vsctl set open . external-ids:ovn-cms-options=enable-chassis-as-gw
/usr/bin/ovs-vsctl add-br br-provider -- set bridge br-provider protocols=OpenFlow10,OpenFlow12,OpenFlow13,OpenFlow14,OpenFlow15
/usr/bin/ovs-vsctl add-port br-provider {{ provider_nic }}
/usr/bin/ovs-vsctl br-set-external-id provider bridge-id br-provider
/usr/bin/ovs-vsctl set open . external-ids:ovn-bridge-mappings=provider:br-provider

plugin.ini:
[ml2]
mechanism_drivers = ovn
type_drivers = flat,geneve
tenant_network_types = geneve
extension_drivers = port_security
overlay_ip_version = 4
[ml2_type_flat]
flat_networks = provider
[ml2_type_geneve]
vni_ranges = 1:65536
max_header_size = 38
[securitygroup]
enable_security_group = True
firewall_driver = neutron.agent.linux.iptables_firewall.OVSHybridIptablesFirewallDriver
[ovn]
ovn_nb_connection = tcp:{{controllerip}}:6641
ovn_sb_connection = tcp:{{controllerip}}:6642
ovn_l3_scheduler = leastloaded
ovn_metadata_enabled = True
enable_distributed_floating_ip = true




-- 
Gary Molenkamp			Science Technology Services
Systems Administrator		University of Western Ontario
molenkam@uwo.ca                 http://sts.sci.uwo.ca
(519) 661-2111 x86882		(519) 661-3566

-- 
Gary Molenkamp			Science Technology Services
Systems Engineer		University of Western Ontario
molenkam@uwo.ca                 http://sts.sci.uwo.ca
(519) 661-2111 x86882		(519) 661-3566