Hi folks, This is detailed context on the discussions we've had recently about multi-conductor in metal3. I'm sending it to potentially interested people and cc'ing the openstack ML. The metal3 side is https://github.com/dtantsur/ironic-operator/issues/3 (although that covers a few more different issues). In metal3 we have Ironic without Neutron (or really any other OpenStack components). So far, we have only supported a single Ironic instance in a cluster, but we want to move towards a multi-conductor setup where every Kubernetes control plane node will host an Ironic (so, 3 Ironics in a normal deployment). The biggest problem stems from DHCP for the provisioning network. With Neutron, Ironic can configure it to direct a node to boot from a specific iPXE server. With Metal3, we have a static DHCP configuration. If we keep it the way it is, it will always direct a node to boot from the iPXE server on the same machine as dnsmasq (regardless of how many dnsmasq replicas we run). This has to be changed. I see several options, none of which are straightforward. (1) iPXE-level fix: boot config API This is what https://bugs.launchpad.net/ironic/+bug/2044561 proposes. If we stop using static iPXE scripts and start going through the Ironic API instead, Ironic can reach out to the correct conductor internally, always serving the right script. (2) dnsmasq DHCP implementation Ironic now has a non-neutron managed DHCP implementation through DHCP host and options files: https://opendev.org/openstack/ironic/src/branch/master/ironic/dhcp/dnsmasq.p.... This could be a half of the solution. Since Metal3 does not support new node auto-discovery, we can run 3 copies of dnsmasq with dhcp-ignore=!known and rely on this DHCP provider to configure only 1 dnsmasq for each node. The other 2 will ignore the requests. The problem is in the DHCP range. As far as I understand, the 3 dnsmasqs must be configured with disjoint ranges to prevent leases from clashing. This seems non-trivial with Kubernetes which expects to create identical pods. Adam has suggested using annotations from inside ironic-operator, which could work if we find a way to read them inside the pod after it has started (which seems to pose a chicken-and-egg problem). Other workarounds are possible if we go down this path. E.g. ironic-operator could split the DHCP range and creating a mapping hostIP:subrange,hostIP2:subrange2 and pass it to pods. Then each pod will pick its subrange based on its IP address. Huge downside: every time a Kubernetes control plane node is replaced or changes its IP, the whole Ironic installation will get restarted. (2.1) Pre-created leases Another Metal3 contributor has suggested requiring an IPAM implementation (metal3 actually has one) that will create leases centrally, allowing us to use the same DHCP range. The only problem: how to pass them to dnsmasq in runtime without restarting all pods? We could probably have a small sidecar container next to dnsmasq that watches the Kubernetes API for new leases in IPAM. Or... (2.2) Dynamic DHCP and Kea The same contributor is working on Kea support for their deployment: https://hackmd.io/@7RNIJtmvSIeNpSOMTW9oqA/SkWhyr7Hp. There seem to be limitations in the free version, which caused them to write a tiny downstream plugin: https://gitlab.com/Orange-OpenSource/kanod/kanod-kea/-/blob/controlled-lease.... Changing the DHCP server is a dramatic step, I'm not sure if our small community can afford that. It may also pose problems for downstreams like OpenShift, requiring Metal3 to support both servers in the long run. Any thoughts are welcome. Dmitry