[neutron][openvswitch] NAT + AFS = Sadness
HI All, CERN may be the only other people doing anything with AFS hopefully this catches someone's eye who will just know.... The AFS filesystem use a rather "unique" connection oriented UDP protocol https://docs.kernel.org/networking/rxrpc.html This is where all our user home directories live so at's a key requirement for us. In my production cloud (mitaka sadly) we use direclty routable fixed IPs by default and all is well. In test (epoxy) we want to move away from the public v4 default IPs and lean more on floating ips for public addressing. Testing this I found impossibly poor performance for AFS behind NAT (with or without floating IP in front) on both Mitaka and Epoxy (both using openvswitch and DVR). This isn't a general NAT issue as I and others can use AFS from our homes behind comsumer NAT boxes of various sorts, and stand alone VMs out side OpenStack behind iptables based NAT are also fine. tcpdumping the connection on the fileserver sees strings of ICMP unreachable messages from the OpenStack router address (or floating ip if attahced) mid conversation. My guess is that the NAT is not maintianing stable enough port mapping becuase of the generally true assumption that UDP is stateless and will work itself out. I'm not really deeply familiar with using openvswitch for NAT as seems to be happening. Is there something I can look for in the flow rules to see what's going on? Can I tune anything to make the mappings more stable (assuming that's the issue) or am I just going to need to keep the old architecture that works? Thanks, -Jon -- Jonathan Proulx (he/him) Sr. Technical Architect The Infrastructure Group MIT CSAIL
On 2025-11-17 09:44:20 -0500 (-0500), Jonathan Proulx wrote:
CERN may be the only other people doing anything with AFS hopefully this catches someone's eye who will just know.... [...]
The OpenDev Collaboratory (in which OpenStack itself is developed) relies heavily on AFS and has for around a decade run OpenAFS clients and servers on OpenStack Nova instances. Sadly I don't have an answer for your dilemma, but am similarly interested. In our case, our OpenAFS servers all have interfaces on directly-connected provider networks. The only cases where we do AFS through floating IP are OpenAFS clients. While I haven't observed performance degradation related specifically to this, because we distribute our servers and clients globally the latency is fairly debilitating to throughput already. We've also been evaluating kAFS off and on for our AFS clients, but doubt it deals with these specific protocol challenges any better than OpenAFS. I've heard AuriStor (a commercial fork) has implemented some extensions to/revision of the protocol in order to make it more efficient over the modern Internet, so might be worth looking into if being strictly open source isn't a requirement in your environment (in OpenDev we only use open source software, so don't have direct experience with it). -- Jeremy Stanley
Hi Jeremy On Mon, Nov 17, 2025 at 04:27:03PM +0000, Jeremy Stanley wrote: :On 2025-11-17 09:44:20 -0500 (-0500), Jonathan Proulx wrote: :> CERN may be the only other people doing anything with AFS hopefully this :> catches someone's eye who will just know.... :[...] : :The OpenDev Collaboratory (in which OpenStack itself is developed) relies :heavily on AFS and has for around a decade run OpenAFS clients and servers on :OpenStack Nova instances. Sadly I don't have an answer for your dilemma, but :am similarly interested. How did I not know this, probably I forgot :) :In our case, our OpenAFS servers all have interfaces on directly-connected :provider networks. The only cases where we do AFS through floating IP are :OpenAFS clients. While I haven't observed performance degradation related :specifically to this, because we distribute our servers and clients globally :the latency is fairly debilitating to throughput already. This is encouraging that it is some configuration issue. So you do have clients with "private" network addreses using floating ip that access your AFS cell normally? For a long time we've used provider networks for our fixed ips and not used floating ips and all has been well. Using a private network with a neutron router doing NAT I can log in on the VNC console and things seem normal. When I attach the floating ip everything breaks. AFS isn't great with ip changes but this stays broken across soft reboots so I don't *think* it's an issue of connectign via the NAT IP first then trying to talk through the Floating IP. I'm currently trying to map out the actual packet flow and going a bit cross eyed with the the redundant network nodes and the dvr but I'll get there... -Jon -- Jonathan Proulx (he/him) Sr. Technical Architect The Infrastructure Group MIT CSAIL
On 2025-11-17 14:23:23 -0500 (-0500), Jonathan Proulx wrote: [...]
So you do have clients with "private" network addreses using floating ip that access your AFS cell normally? [...]
Yes, currently one of the public cloud providers who donates resources to OpenDev has, for reasons I don't understand, mandated that only user-defined RFC-1918 networks can be attached to server instances so that all Internet access from server instances requires NAT. We have OpenAFS 1.8.13 currently running on Ubuntu 24.04 LTS virtual machine instances in three regions there, all connecting to the Internet through floating IPs, which is how they communicate with our Internet-connected AFS fileservers (all of which are in other cloud providers). We haven't done anything special for these, though it's probably worth pointing out that they aren't performing write operations into AFS, and are merely unauthenticated Web front-ends anonymously serving content from read-only AFS replica volumes. It's possible that our limited usage pattern in this case is what's saving us from encountering the problems you're having. -- Jeremy Stanley
On Mon, Nov 17, 2025 at 07:51:15PM +0000, Jeremy Stanley wrote: :On 2025-11-17 14:23:23 -0500 (-0500), Jonathan Proulx wrote: :[...] :> So you do have clients with "private" network addreses using floating ip :> that access your AFS cell normally? :[...] : :Yes, currently one of the public cloud providers who donates resources to :OpenDev has, for reasons I don't understand, mandated that only user-defined :RFC-1918 networks can be attached to server instances so that all Internet :access from server instances requires NAT. We have OpenAFS 1.8.13 currently :running on Ubuntu 24.04 LTS virtual machine instances in three regions there, :all connecting to the Internet through floating IPs, which is how they :communicate with our Internet-connected AFS fileservers (all of which are in :other cloud providers). OK either I'm subtly screwing up my neutron config in a way only AFS notices (quite possible) or they're using a different neutron plugin. I replicated that which is very near our setup, except the server are 1 hop away in the same room, and as soon as I add the Floating IP I loose connection to all the fileservers. The qrouter netns for the router on the private network exists on the hypervisor and all three network nodes (in test) and when it's working I see inbound traffic from the fileserver in this netns on of the controllers but the outbound traffic on the hypervisor 's qrouter netns. When I set the floating ip all traffic appears on the hypervisor's netns, which honestly with DVR is what I expected. So this is somewhat multiply odd, I would expect the first case with asymmetric paths to be more broken than the second. Also Friday when I first hit this I was seeing ICMP unreachables on the fileserver (and not inside the VM) but right now it seems like there's no errors and packets match on both ends of the connection it's just not working... For now I'm going to presume it's me and sift through my neutron configs to see if I'm setting conflicting or otherwise nonsensical options. -Jon -- Jonathan Proulx (he/him) Sr. Technical Architect The Infrastructure Group MIT CSAIL
participants (2)
-
Jeremy Stanley
-
Jonathan Proulx