On 30/11/23 04:22, Dmitry Tantsur wrote:
Hi folks,
This is detailed context on the discussions we've had recently about multi-conductor in metal3. I'm sending it to potentially interested people and cc'ing the openstack ML. The metal3 side is https://github.com/dtantsur/ironic-operator/issues/3 (although that covers a few more different issues).
In metal3 we have Ironic without Neutron (or really any other OpenStack components). So far, we have only supported a single Ironic instance in a cluster, but we want to move towards a multi-conductor setup where every Kubernetes control plane node will host an Ironic (so, 3 Ironics in a normal deployment). The biggest problem stems from DHCP for the provisioning network.
With Neutron, Ironic can configure it to direct a node to boot from a specific iPXE server. With Metal3, we have a static DHCP configuration. If we keep it the way it is, it will always direct a node to boot from the iPXE server on the same machine as dnsmasq (regardless of how many dnsmasq replicas we run). This has to be changed. I see several options, none of which are straightforward.
(1) iPXE-level fix: boot config API
This is what https://bugs.launchpad.net/ironic/+bug/2044561 proposes. If we stop using static iPXE scripts and start going through the Ironic API instead, Ironic can reach out to the correct conductor internally, always serving the right script.
(2) dnsmasq DHCP implementation
Ironic now has a non-neutron managed DHCP implementation through DHCP host and options files: https://opendev.org/openstack/ironic/src/branch/master/ironic/dhcp/dnsmasq.p.... This could be a half of the solution. Since Metal3 does not support new node auto-discovery, we can run 3 copies of dnsmasq with dhcp-ignore=!known and rely on this DHCP provider to configure only 1 dnsmasq for each node. The other 2 will ignore the requests.
The problem is in the DHCP range. As far as I understand, the 3 dnsmasqs must be configured with disjoint ranges to prevent leases from clashing. This seems non-trivial with Kubernetes which expects to create identical pods. Adam has suggested using annotations from inside ironic-operator, which could work if we find a way to read them inside the pod after it has started (which seems to pose a chicken-and-egg problem).
The openstack-k8s-operators/ironic-operator runs conductors in a stateful set, and arguably the pods really are stateful given the node specific assets that get written out. Stateful sets have index based pod names, which provides a pod index that can be keyed off for pod specific config. I see ironic pods are running either in a DaemonSet or a Deployment. Maybe these could consolidate to only a StatefulSet with a nodeSelector to replicate whatever the DaemonSet is doing? As for the dnsmasq driver, the per-host dhcp-option entries are tagged with a uuid unique to that node. If I'm reading the docs correctly[1] the driver could be modified to add a second tag which is unique to the conductor (maybe even the conductor host name). Then a disjoint dhcp-range entry can be tagged with that same conductor tag. You could even write the same multiple dhcp-range entries to all pods and the correct range for a node will entirely be driven by tags (and the above about StatefulSet can be ignored). Finally, we may as well have the discussion now about how there are 2 operators called ironic-operator with distinct and valid use cases. I am very much looking forward to using dtantsur/ironic-operator but sure I won't be alone in finding it profoundly confusing to know which is what once it moves out of the dtantsur github into (presumably) metal3-io. Could dtantsur/ironic-operator possibly be renamed? I know it sucks because the current name describes exactly what it does. The name of openstack-k8s-operators/ironic-operator is bound by other operators in that project, and is too far down the productization roadmap to be renamed at this stage. [1] https://thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html
Other workarounds are possible if we go down this path. E.g. ironic-operator could split the DHCP range and creating a mapping hostIP:subrange,hostIP2:subrange2 and pass it to pods. Then each pod will pick its subrange based on its IP address. Huge downside: every time a Kubernetes control plane node is replaced or changes its IP, the whole Ironic installation will get restarted.
(2.1) Pre-created leases
Another Metal3 contributor has suggested requiring an IPAM implementation (metal3 actually has one) that will create leases centrally, allowing us to use the same DHCP range. The only problem: how to pass them to dnsmasq in runtime without restarting all pods? We could probably have a small sidecar container next to dnsmasq that watches the Kubernetes API for new leases in IPAM. Or...
(2.2) Dynamic DHCP and Kea
The same contributor is working on Kea support for their deployment: https://hackmd.io/@7RNIJtmvSIeNpSOMTW9oqA/SkWhyr7Hp. There seem to be limitations in the free version, which caused them to write a tiny downstream plugin: https://gitlab.com/Orange-OpenSource/kanod/kanod-kea/-/blob/controlled-lease....
Changing the DHCP server is a dramatic step, I'm not sure if our small community can afford that. It may also pose problems for downstreams like OpenShift, requiring Metal3 to support both servers in the long run.
Any thoughts are welcome.
Dmitry