Hi Jan, If I understand correctly, the issue you are facing is with ovs-dvr, the floating IPs are implemented in the SNAT namespace on the network node, causing a congestion point for high traffic. You are looking for a way to implement floating IPs that are distributed across your deployment and not concentrated in the network nodes. Is that correct? If so, I think what you are looking for is distributed floating IPs with OVN[1]. I will let the OVN experts confirm this. Michael [1] https://docs.openstack.org/networking-ovn/latest/admin/refarch/refarch.html#... On Fri, May 6, 2022 at 6:19 AM Jan Horstmann <J.Horstmann@mittwald.de> wrote:
Hello!
When we initially deployed openstack we thought that using distributed virtual routing with ml2/ovs-dvr would give us the ability to automatically scale our network capacity with the number of hypervisors we use. Our main workload are kubernetes clusters which receive ingress traffic via octavia loadbancers (configured to use the amphora driver). So the idea was that we could increase the number of loadbalancers to spread the traffic over more and more compute nodes. This would imply that any volume based (distributed) denial of service attack on a single loadbalancer would just saturate a single compute node and leave the rest of the system functional.
We have recently learned that, no matter the loadbalancer topology, a virtual IP is created for it by octavia. This, and probably all virtual IPs in openstack, are reserved by an unbound and disabled port and then set as an allowed address pair on any server's port which might hold it. Up to this point our initial assumption should be true, as the server actually holding the virtual IP would reply to any ARP requests and thus any traffic should be routed to the node with the virtual machine of the octavia amphora. However, we are using our main provider network as a floating IP pool and do not allow direct port creation. When a floating IP is attached to the virtual IP it is assigned to the SNAT router namespace on a network node. Naturally in high traffic or even (distributed) denial of service situations the network node might become a bottleneck. A situation we thought we could avoid by using distributed virtual routing in the first place.
This leads me to a rabbit hole of questions I hope someone might be able to help with:
Is the assessment above correct or am I missing something?
If it is correct, do we have any other options than vertically scaling our network nodes to handle traffic? Do other ml2 drivers (e.g. OVN) handle this scenario differently?
If our network nodes need to handle most of the traffic anyway, do we still have any advantage using distributed virtual routing? Especially when considering the increased complexity compared to a non- distributed setup?
Has anyone ever explored non-virtual IP based high availability options like e.g. BGP multipathing in a distributed virtual routing scenario?
Any input is highly appreciated. Regards, Jan