On Tue, Aug 24, 2021 at 4:24 PM Clark Boylan <cboylan@sapwetik.org> wrote:
On Tue, Aug 24, 2021, at 11:21 AM, Goutham Pacha Ravi wrote:
Hi,
Bubbling up this issue - I reported a launchpad bug with some more debug information: https://bugs.launchpad.net/bugs/1939627
I’m confused how/why only the Manila gates are hitting this issue. If you’re reading - do you have any tests elsewhere that setup a nova instance on a devstack and ping/communicate with the internet/outside world? If yes, I’d love to compare configuration with your setup.
It has been a while since I looked at this stuff and the OVN switch may have changed it, but we have historically intentionally avoided external connectivity for the nested devstack cloud in upstream CI. Instead we expect the test jobs to be self contained. On multinode jobs we set up an entire L2 network with very simple L3 routing that is independent of the host system networking with vxlan. This allows tempest to talk to the test instances on the nested cloud. But those nested cloud instances cannot get off the instance. This is important because it helps keep things like dhcp requests from leaking out into the world.
Even if you configured floating IPs on the inner instances those IPs wouldn't be routable to the host instance in the clouds that we host jobs in. To make this work without a bunch of support from the hosting environment you would need to NAT between the host instances IP and inner nested devstack instances so that traffic exiting the test instances had a path to the outside world that can receive the return traffic.
Thanks Clark; the tests we're running are either outside CI - i.e., with local devstack, or on third party CI systems where the guest VMs would need to access storage that's external to the devstack. These are tempest tests that have no reason to reach out to the external world. I didn't know the reason behind this design, so this insight is useful to me!
In https://bugs.launchpad.net/bugs/1939627 you are creating an external network and floating IPs but once that traffic leaves the host system there must be a return path for any responses, and I suspect that is what is missing? It is odd that this would coincide with the OVN change. Maybe IP ranges were updated with the OVN change and any hosting support that enabled routing of those IPs is no longer valid as a result?
I was scratching my head and reading the OVN integration code as well - but neutron internals wasn't my strong suit. slaweq and ralonsoh were able to root-cause this to missing NAT rule/s in the devstack ovn-agent setup. \o/
Thank you so much for your help,
Goutham