Controller node openstack LXC containers clashing on IP subnets
Hi all, I've got much further now, and am now tracking down networking issues. I'm pretty sure everything is now installing, but I'm seeing a systematic issue. My test network that I'm running the openstack instances on is 10.0.1.x -- the VMs all have fixed IP addresses, can talk to each other, etc, nothing weird going on. haproxy is installing and running, but throwing 503 errors. Digging into this, it seems that there are some issues with the network configs of at least some of the LXC containers. The one I'm seeing that's preventing Ansible from completing the infrastructure setup is the repo server container. If I attach to the container, the keepalive comes back correctly. Externally to the container, the HTTP connection is rejected. Looking into the reasons, it looks like there are 3 networks visible from inside the container: ens18: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.0.1.38 netmask 255.255.255.0 broadcast 10.0.1.255 inet6 fe80::216:3eff:feab:ab45 prefixlen 64 scopeid 0x20<link> ether 00:16:3e:ab:ab:45 txqueuelen 1000 (Ethernet) RX packets 96 bytes 6952 (6.9 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 16 bytes 1236 (1.2 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.0.3.113 netmask 255.255.255.0 broadcast 10.0.3.255 inet6 fe80::216:3eff:fe45:64d6 prefixlen 64 scopeid 0x20<link> ether 00:16:3e:45:64:d6 txqueuelen 1000 (Ethernet) RX packets 351 bytes 425285 (425.2 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 263 bytes 19114 (19.1 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 2939 bytes 208134 (208.1 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2939 bytes 208134 (208.1 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 The 10.0.1.38 address seems to be the problem. I think this is an internally routed subnet, not the actual physical subnet (note the lack of a gateway address, and 10.0.1.38 is definitely not being allocated via the DCHP server). Looking at some other containers, I'm seeing 10.0.2.x and 10.0.3.x there, so obviously this is being allocated on the fly either by lxc or the container(s). TL;DR: The internal 10.0.1.x is clashing with the physical 10.0.1.x network, which is almost certainly why the keepalive is failing. Does anyone have any idea how to fix the configuration to use some other CIDR block for this? I'd like to avoid the extreme pain of remapping my physical network (this is a way more complicated problem than a few test nodes unfortunotely!) Thank you in advance, Sarah Thompson -- [s]
Hey Sarah, Your containers are using the same IP range as your physical network (both are on 10.0.1.x) and thats causing chaos :( when you try to connect to a container (like your repo server), the network gets confused and HAProxy throws a 503 Service Unavailable error.. Basically the IPs are clashing, and traffic is getting tangled up. just leave your physical network alone and just shift your containers to a different IP range like 192.168.0.x - this way everyone stays in their own lane and the conflicts disappear and you re good to go. Sevgiler, Best, Kerem ÇELİKER On Mon, Mar 17, 2025 at 20:23 Sarah Thompson <plodger@gmail.com> wrote:
Hi all,
I've got much further now, and am now tracking down networking issues. I'm pretty sure everything is now installing, but I'm seeing a systematic issue.
My test network that I'm running the openstack instances on is 10.0.1.x -- the VMs all have fixed IP addresses, can talk to each other, etc, nothing weird going on.
haproxy is installing and running, but throwing 503 errors. Digging into this, it seems that there are some issues with the network configs of at least some of the LXC containers. The one I'm seeing that's preventing Ansible from completing the infrastructure setup is the repo server container. If I attach to the container, the keepalive comes back correctly. Externally to the container, the HTTP connection is rejected.
Looking into the reasons, it looks like there are 3 networks visible from inside the container:
ens18: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.0.1.38 netmask 255.255.255.0 broadcast 10.0.1.255 inet6 fe80::216:3eff:feab:ab45 prefixlen 64 scopeid 0x20<link> ether 00:16:3e:ab:ab:45 txqueuelen 1000 (Ethernet) RX packets 96 bytes 6952 (6.9 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 16 bytes 1236 (1.2 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 10.0.3.113 netmask 255.255.255.0 broadcast 10.0.3.255 inet6 fe80::216:3eff:fe45:64d6 prefixlen 64 scopeid 0x20<link> ether 00:16:3e:45:64:d6 txqueuelen 1000 (Ethernet) RX packets 351 bytes 425285 (425.2 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 263 bytes 19114 (19.1 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 2939 bytes 208134 (208.1 KB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2939 bytes 208134 (208.1 KB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
The 10.0.1.38 address seems to be the problem. I think this is an internally routed subnet, not the actual physical subnet (note the lack of a gateway address, and 10.0.1.38 is definitely not being allocated via the DCHP server). Looking at some other containers, I'm seeing 10.0.2.x and 10.0.3.x there, so obviously this is being allocated on the fly either by lxc or the container(s).
TL;DR: The internal 10.0.1.x is clashing with the physical 10.0.1.x network, which is almost certainly why the keepalive is failing.
Does anyone have any idea how to fix the configuration to use some other CIDR block for this? I'd like to avoid the extreme pain of remapping my physical network (this is a way more complicated problem than a few test nodes unfortunotely!)
Thank you in advance, Sarah Thompson
-- [s]
On 17/03/2025 17:42, Kerem Celiker wrote:
Hey Sarah,
Your containers are using the same IP range as your physical network (both are on 10.0.1.x) and thats causing chaos :( when you try to connect to a container (like your repo server), the network gets confused and HAProxy throws a 503 Service Unavailable error..
Basically the IPs are clashing, and traffic is getting tangled up.
just leave your physical network alone and just shift your containers to a different IP range like 192.168.0.x - this way everyone stays in their own lane and the conflicts disappear and you re good to go.
For openstack-ansible the default behaviour is to use addresses from the openstack management network for both the physical hosts and the LXC containers, as described though all our documentation. It is possible to use seperate network ranges for the physical hosts, but this requires special configuration to work correctly as described here https://docs.openstack.org/openstack-ansible/latest/reference/inventory/conf... Regards, Jonathan.
OK, got it. I'm reconfiguring the network, hopefully better luck next time! Thanks all! On Mon, Mar 17, 2025 at 6:13 PM Jonathan Rosser < jonathan.rosser@rd.bbc.co.uk> wrote:
On 17/03/2025 17:42, Kerem Celiker wrote:
Hey Sarah,
Your containers are using the same IP range as your physical network (both are on 10.0.1.x) and thats causing chaos :( when you try to connect to a container (like your repo server), the network gets confused and HAProxy throws a 503 Service Unavailable error..
Basically the IPs are clashing, and traffic is getting tangled up.
just leave your physical network alone and just shift your containers to a different IP range like 192.168.0.x - this way everyone stays in their own lane and the conflicts disappear and you re good to go.
For openstack-ansible the default behaviour is to use addresses from the openstack management network for both the physical hosts and the LXC containers, as described though all our documentation.
It is possible to use seperate network ranges for the physical hosts, but this requires special configuration to work correctly as described here
https://docs.openstack.org/openstack-ansible/latest/reference/inventory/conf...
Regards, Jonathan.
-- [s]
Hi Sarah,
Does anyone have any idea how to fix the configuration to use some other CIDR block for this? I'd like to avoid the extreme pain of remapping my physical network (this is a way more complicated problem than a few test nodes unfortunotely!)
I suspect that the additional addresses in 10.0.1.x are being allocated by the ansible dynamic inventory. If you already have a 10.0.1.x network for your VM IP, and you are using that as the openstack management network, then some part of that network must be put aside for openstack-ansible to allocate IP from. Here is an example of defining the network ranges (cidr_networks), and the addresses which are *not* available for allocation by the openstack-ansible dynamic inventory https://github.com/openstack/openstack-ansible/blob/29ce380dd8c0d2533ef3a4a4... Each host needs an IP address on the openstack management network, as does each LXC container. If you are able to share your openstack_user_config.yml, perhaps at paste.opendev.org, then it would be possible to provide some pointers. Also please do join the #openstack-ansible IRC channel if possible and you can get some more interactive help. Regards, Jonathan.
Thank you in advance, Sarah Thompson
-- [s]
participants (3)
-
Jonathan Rosser
-
Kerem Celiker
-
Sarah Thompson