Hi folks,
This is detailed context on the discussions we've had recently about
multi-conductor in metal3. I'm sending it to potentially interested
people and cc'ing the openstack ML. The metal3 side is
https://github.com/dtantsur/ironic-operator/issues/3 (although that
covers a few more different issues).
In metal3 we have Ironic without Neutron (or really any other OpenStack
components). So far, we have only supported a single Ironic instance in
a cluster, but we want to move towards a multi-conductor setup where
every Kubernetes control plane node will host an Ironic (so, 3 Ironics
in a normal deployment). The biggest problem stems from DHCP for the
provisioning network.
With Neutron, Ironic can configure it to direct a node to boot from a
specific iPXE server. With Metal3, we have a static DHCP configuration.
If we keep it the way it is, it will always direct a node to boot from
the iPXE server on the same machine as dnsmasq (regardless of how many
dnsmasq replicas we run). This has to be changed. I see several options,
none of which are straightforward.
(1) iPXE-level fix: boot config API
This is what https://bugs.launchpad.net/ironic/+bug/2044561 proposes. If
we stop using static iPXE scripts and start going through the Ironic API
instead, Ironic can reach out to the correct conductor internally,
always serving the right script.
(2) dnsmasq DHCP implementation
Ironic now has a non-neutron managed DHCP implementation through DHCP
host and options files:
https://opendev.org/openstack/ironic/src/branch/master/ironic/dhcp/dnsmasq.py.
This could be a half of the solution. Since Metal3 does not support new
node auto-discovery, we can run 3 copies of dnsmasq with
dhcp-ignore=!known and rely on this DHCP provider to configure only 1
dnsmasq for each node. The other 2 will ignore the requests.
The problem is in the DHCP range. As far as I understand, the 3 dnsmasqs
must be configured with disjoint ranges to prevent leases from clashing.
This seems non-trivial with Kubernetes which expects to create identical
pods. Adam has suggested using annotations from inside ironic-operator,
which could work if we find a way to read them inside the pod after it
has started (which seems to pose a chicken-and-egg problem).
Other workarounds are possible if we go down this path. E.g.
ironic-operator could split the DHCP range and creating a mapping
hostIP:subrange,hostIP2:subrange2 and pass it to pods. Then each pod
will pick its subrange based on its IP address. Huge downside: every
time a Kubernetes control plane node is replaced or changes its IP, the
whole Ironic installation will get restarted.
(2.1) Pre-created leases
Another Metal3 contributor has suggested requiring an IPAM
implementation (metal3 actually has one) that will create leases
centrally, allowing us to use the same DHCP range. The only problem: how
to pass them to dnsmasq in runtime without restarting all pods? We could
probably have a small sidecar container next to dnsmasq that watches the
Kubernetes API for new leases in IPAM. Or...
(2.2) Dynamic DHCP and Kea
The same contributor is working on Kea support for their deployment:
https://hackmd.io/@7RNIJtmvSIeNpSOMTW9oqA/SkWhyr7Hp. There seem to be
limitations in the free version, which caused them to write a tiny
downstream plugin:
https://gitlab.com/Orange-OpenSource/kanod/kanod-kea/-/blob/controlled-leases/kea-plugin/no_create.cc?ref_type=heads.
Changing the DHCP server is a dramatic step, I'm not sure if our small
community can afford that. It may also pose problems for downstreams
like OpenShift, requiring Metal3 to support both servers in the long run.
Any thoughts are welcome.
Dmitry