Back in the Diablo to Havana timeframe I helped to maintain a custom-made DHCP server for an OpenStack distribution, and since it only had one job to do it did it pretty well and it seemed to make sense to go custom. That was before IPv6 became a prerequisite and in environments that were a lot less complex than those seen today, but even back then there were a lot of corner cases and problems that required hot patches, and eventually we ditched the custom code for dnsmasq. I wouldn’t want to support a DHCP server that wasn’t widely used these days. 

Kea is new, but it’s from ISC, which developed the original old-school DHCP server which Kea replaces. I would definitely feel comfortable going with Kea, maybe more so than dnsmasq considering that Kea is designed from the group up for a multiserver environment. It adds a dependency on a MySQL or Postgres database, but that’s a simple ask when you have a Kubernetes cluster at your disposal. 

Changing DHCP servers is painful, but I would vote we go for 2.2 (dynamic DHCP and Kea) and prepare for the future. 

Dan Sneddon  |  Senior Principal OpenStack Engineer  |  dsneddon@redhat.com

On Nov 29, 2023, at 7:30 AM, Dmitry Tantsur <dtantsur@protonmail.com> wrote:

Hi folks,

This is detailed context on the discussions we've had recently about
multi-conductor in metal3. I'm sending it to potentially interested
people and cc'ing the openstack ML. The metal3 side is
https://github.com/dtantsur/ironic-operator/issues/3 (although that
covers a few more different issues).

In metal3 we have Ironic without Neutron (or really any other OpenStack
components). So far, we have only supported a single Ironic instance in
a cluster, but we want to move towards a multi-conductor setup where
every Kubernetes control plane node will host an Ironic (so, 3 Ironics
in a normal deployment). The biggest problem stems from DHCP for the
provisioning network.

With Neutron, Ironic can configure it to direct a node to boot from a
specific iPXE server. With Metal3, we have a static DHCP configuration.
If we keep it the way it is, it will always direct a node to boot from
the iPXE server on the same machine as dnsmasq (regardless of how many
dnsmasq replicas we run). This has to be changed. I see several options,
none of which are straightforward.

(1) iPXE-level fix: boot config API

This is what https://bugs.launchpad.net/ironic/+bug/2044561 proposes. If
we stop using static iPXE scripts and start going through the Ironic API
instead, Ironic can reach out to the correct conductor internally,
always serving the right script.

(2) dnsmasq DHCP implementation

Ironic now has a non-neutron managed DHCP implementation through DHCP
host and options files:
https://opendev.org/openstack/ironic/src/branch/master/ironic/dhcp/dnsmasq.py.
This could be a half of the solution. Since Metal3 does not support new
node auto-discovery, we can run 3 copies of dnsmasq with
dhcp-ignore=!known and rely on this DHCP provider to configure only 1
dnsmasq for each node. The other 2 will ignore the requests.

The problem is in the DHCP range. As far as I understand, the 3 dnsmasqs
must be configured with disjoint ranges to prevent leases from clashing.
This seems non-trivial with Kubernetes which expects to create identical
pods. Adam has suggested using annotations from inside ironic-operator,
which could work if we find a way to read them inside the pod after it
has started (which seems to pose a chicken-and-egg problem).

Other workarounds are possible if we go down this path. E.g.
ironic-operator could split the DHCP range and creating a mapping
hostIP:subrange,hostIP2:subrange2 and pass it to pods. Then each pod
will pick its subrange based on its IP address. Huge downside: every
time a Kubernetes control plane node is replaced or changes its IP, the
whole Ironic installation will get restarted.

(2.1) Pre-created leases

Another Metal3 contributor has suggested requiring an IPAM
implementation (metal3 actually has one) that will create leases
centrally, allowing us to use the same DHCP range. The only problem: how
to pass them to dnsmasq in runtime without restarting all pods? We could
probably have a small sidecar container next to dnsmasq that watches the
Kubernetes API for new leases in IPAM. Or...

(2.2) Dynamic DHCP and Kea

The same contributor is working on Kea support for their deployment:
https://hackmd.io/@7RNIJtmvSIeNpSOMTW9oqA/SkWhyr7Hp. There seem to be
limitations in the free version, which caused them to write a tiny
downstream plugin:
https://gitlab.com/Orange-OpenSource/kanod/kanod-kea/-/blob/controlled-leases/kea-plugin/no_create.cc?ref_type=heads.

Changing the DHCP server is a dramatic step, I'm not sure if our small
community can afford that. It may also pose problems for downstreams
like OpenShift, requiring Metal3 to support both servers in the long run.

Any thoughts are welcome.

Dmitry