On Mon, Mar 17, 2025 at 20:23 Sarah Thompson <plodger@gmail.com> wrote:

Hi all,

I've got much further now, and am now tracking down networking issues. I'm pretty sure everything is now installing, but I'm seeing a systematic issue.

My test network that I'm running the openstack instances on is 10.0.1.x -- the VMs all have fixed IP addresses, can talk to each other, etc, nothing weird going on.

haproxy is installing and running, but throwing 503 errors. Digging into this, it seems that there are some issues with the network configs of at least some of the LXC containers. The one I'm seeing that's preventing Ansible from completing the infrastructure setup is the repo server container. If I attach to the container, the keepalive comes back correctly. Externally to the container, the HTTP connection is rejected.

Looking into the reasons, it looks like there are 3 networks visible from inside the container:

ens18: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.1.38 netmask 255.255.255.0 broadcast 10.0.1.255
inet6 fe80::216:3eff:feab:ab45 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:ab:ab:45 txqueuelen 1000 (Ethernet)
RX packets 96 bytes 6952 (6.9 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 16 bytes 1236 (1.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.3.113 netmask 255.255.255.0 broadcast 10.0.3.255
inet6 fe80::216:3eff:fe45:64d6 prefixlen 64 scopeid 0x20<link>
ether 00:16:3e:45:64:d6 txqueuelen 1000 (Ethernet)
RX packets 351 bytes 425285 (425.2 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 263 bytes 19114 (19.1 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 2939 bytes 208134 (208.1 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 2939 bytes 208134 (208.1 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

The 10.0.1.38 address seems to be the problem. I think this is an internally routed subnet, not the actual physical subnet (note the lack of a gateway address, and 10.0.1.38 is definitely not being allocated via the DCHP server). Looking at some other containers, I'm seeing 10.0.2.x and 10.0.3.x there, so obviously this is being allocated on the fly either by lxc or the container(s).

TL;DR: The internal 10.0.1.x is clashing with the physical 10.0.1.x network, which is almost certainly why the keepalive is failing.

Does anyone have any idea how to fix the configuration to use some other CIDR block for this? I'd like to avoid the extreme pain of remapping my physical network (this is a way more complicated problem than a few test nodes unfortunotely!)

Thank you in advance,
Sarah Thompson

--
[s]