Hello Gary:

If you have 2 VMs in the same network, both without FIPs and one is working but not the other, I would just compare 1:1 the Neutron and OVN resources of both ports (I guess both VMs have one single port). I would start with the OVN NAT registers. You should also check the internal VM routing table.

Apart from that, you should also trace the VM traffic, to know where it is dropped. Maybe the traffic is sent correctly outside the GW port but never gets back (in that case, check you underlying network configuration). Or, as you commented, the SNAT is not working for this specific port.

Regards.

On Tue, Jun 27, 2023 at 1:38 PM Gary Molenkamp <molenkam@uwo.ca> wrote:
Good morning,   I'm having a problem with snat routing under OVN but I'm
not sure if something is mis-configured or just my understanding of how
OVN is architected is wrong.

I've built a Zed cloud, since upgraded to Antelope, using the Neutron
Manual install method here:
https://docs.openstack.org/neutron/latest/install/ovn/manual_install.html
I'm using a multi-tenent configuration using geneve and the flat
provider network is present on each hypervisor. Each hypervisor is
connected to the physical provider network, along with the tenent
network and is tagged as an external chassis under OVN.
         br-int exists, as does br-provider
         ovs-vsctl set open .
external-ids:ovn-cms-options=enable-chassis-as-gw

For most cases, distributed FIP based connectivity is working without
issue, but I'm having an issue where VMs without a FIP are not always
able to use the SNAT services of the tenent network router.
Scenario:
     Internal network named cs3319:  with subnet 172.31.100.0/23
     Has a router named cs3319_router with external gateway set (snat
enabled)

     This network has 3 vms:
         - #1 has a FIP and can be accessed externally
         - #2 has no FIP, can be accessed via VM1 and can access
external resources via SNAT  (ie OS repos, DNS, etc)
         - #3 has no FIP, can be accessed via VM1 but has no external
SNAT connectivity

 From what I can tell,  the chassis config is correct, compute05 is the
hypervisor and the faulty VM has a port binding on this hypervisor:

ovn-sbctl show
...
Chassis "8e0fa17c-e480-4b60-9015-bd8833412561"
     hostname: compute05.cloud.sci.uwo.ca
     Encap geneve
         ip: "192.168.0.105"
         options: {csum="true"}
     Port_Binding "7a5257eb-caea-45bf-b48c-620c5dff4b39"
     Port_Binding "50e16602-78e6-429b-8c2f-e7e838ece1b4"
     Port_Binding "f121c9f4-c3fe-4ea9-b754-a809be95a3fd"

The router has the candidate gateways, and the snat set:

ovn-nbctl show  92df19a7-4ebe-43ea-b233-f4e9f5a46e7c
router 92df19a7-4ebe-43ea-b233-f4e9f5a46e7c
(neutron-389439b5-07f8-44b6-a35b-c76651b48be5) (aka cs3319_public_router)
     port lrp-44ae1753-845e-4822-9e3d-a41e0469e257
         mac: "fa:16:3e:9a:db:d8"
         networks: ["129.100.21.94/22"]
         gateway chassis: [5c039d38-70b2-4ee6-9df1-596f82c68106
99facd23-ad17-4b68-a8c2-1ff6da15ac5f
1694116c-6d30-4c31-b5ea-0f411878316e
2a4bbaf9-228a-462e-8970-0cdbf59086e6 9332c61b-93e1-4a70-9547-701a014bfd98]
     port lrp-509bba37-fa06-42d6-9210-2342045490db
         mac: "fa:16:3e:ff:0f:3b"
         networks: ["172.31.100.1/23"]
     nat 11e0565a-4695-4f67-b4ee-101f1b1b9a4f
         external ip: "129.100.21.94"
         logical ip: "172.31.100.0/23"
         type: "snat"
     nat 21e4be02-d81c-46e8-8fa8-3f94edb4aed1
         external ip: "129.100.21.87"
         logical ip: "172.31.100.49"
         type: "dnat_and_snat"

Each network agent on the hypervisors shows the ovn controller up :
      OVN Controller Gateway agent | compute05.cloud.sci.uwo.ca
|                   | :-)   | UP    | ovn-controller

The ovs vswitch on the hypervisor looks correct afaict and ovn ports bfd
status are all forwarding to other hypervisors. ie:
    Port ovn-2a4bba-0
             Interface ovn-2a4bba-0
                 type: geneve
                 options: {csum="true", key=flow, remote_ip="192.168.0.106"}
                 bfd_status: {diagnostic="No Diagnostic",
flap_count="1", forwarding="true", remote_diagnostic="No Diagnostic",
remote_state=up, state=up}


Any advice on where to look would be appreciated.

PS.  Version info:
     Neutron 22.0.0-1
     OVN 22.12

    neutron options:
       enable_distributed_floating_ip = true
       ovn_l3_scheduler = leastloaded



Thanks
Gary



--
Gary Molenkamp                  Science Technology Services
Systems/Cloud Administrator     University of Western Ontario
molenkam@uwo.ca                  http://sts.sci.uwo.ca
(519) 661-2111 x86882           (519) 661-3566