[ironic] multi-conductor architecture, standalone and DHCP
Hi folks, This is detailed context on the discussions we've had recently about multi-conductor in metal3. I'm sending it to potentially interested people and cc'ing the openstack ML. The metal3 side is https://github.com/dtantsur/ironic-operator/issues/3 (although that covers a few more different issues). In metal3 we have Ironic without Neutron (or really any other OpenStack components). So far, we have only supported a single Ironic instance in a cluster, but we want to move towards a multi-conductor setup where every Kubernetes control plane node will host an Ironic (so, 3 Ironics in a normal deployment). The biggest problem stems from DHCP for the provisioning network. With Neutron, Ironic can configure it to direct a node to boot from a specific iPXE server. With Metal3, we have a static DHCP configuration. If we keep it the way it is, it will always direct a node to boot from the iPXE server on the same machine as dnsmasq (regardless of how many dnsmasq replicas we run). This has to be changed. I see several options, none of which are straightforward. (1) iPXE-level fix: boot config API This is what https://bugs.launchpad.net/ironic/+bug/2044561 proposes. If we stop using static iPXE scripts and start going through the Ironic API instead, Ironic can reach out to the correct conductor internally, always serving the right script. (2) dnsmasq DHCP implementation Ironic now has a non-neutron managed DHCP implementation through DHCP host and options files: https://opendev.org/openstack/ironic/src/branch/master/ironic/dhcp/dnsmasq.p.... This could be a half of the solution. Since Metal3 does not support new node auto-discovery, we can run 3 copies of dnsmasq with dhcp-ignore=!known and rely on this DHCP provider to configure only 1 dnsmasq for each node. The other 2 will ignore the requests. The problem is in the DHCP range. As far as I understand, the 3 dnsmasqs must be configured with disjoint ranges to prevent leases from clashing. This seems non-trivial with Kubernetes which expects to create identical pods. Adam has suggested using annotations from inside ironic-operator, which could work if we find a way to read them inside the pod after it has started (which seems to pose a chicken-and-egg problem). Other workarounds are possible if we go down this path. E.g. ironic-operator could split the DHCP range and creating a mapping hostIP:subrange,hostIP2:subrange2 and pass it to pods. Then each pod will pick its subrange based on its IP address. Huge downside: every time a Kubernetes control plane node is replaced or changes its IP, the whole Ironic installation will get restarted. (2.1) Pre-created leases Another Metal3 contributor has suggested requiring an IPAM implementation (metal3 actually has one) that will create leases centrally, allowing us to use the same DHCP range. The only problem: how to pass them to dnsmasq in runtime without restarting all pods? We could probably have a small sidecar container next to dnsmasq that watches the Kubernetes API for new leases in IPAM. Or... (2.2) Dynamic DHCP and Kea The same contributor is working on Kea support for their deployment: https://hackmd.io/@7RNIJtmvSIeNpSOMTW9oqA/SkWhyr7Hp. There seem to be limitations in the free version, which caused them to write a tiny downstream plugin: https://gitlab.com/Orange-OpenSource/kanod/kanod-kea/-/blob/controlled-lease.... Changing the DHCP server is a dramatic step, I'm not sure if our small community can afford that. It may also pose problems for downstreams like OpenShift, requiring Metal3 to support both servers in the long run. Any thoughts are welcome. Dmitry
On 30/11/23 04:22, Dmitry Tantsur wrote:
Hi folks,
This is detailed context on the discussions we've had recently about multi-conductor in metal3. I'm sending it to potentially interested people and cc'ing the openstack ML. The metal3 side is https://github.com/dtantsur/ironic-operator/issues/3 (although that covers a few more different issues).
In metal3 we have Ironic without Neutron (or really any other OpenStack components). So far, we have only supported a single Ironic instance in a cluster, but we want to move towards a multi-conductor setup where every Kubernetes control plane node will host an Ironic (so, 3 Ironics in a normal deployment). The biggest problem stems from DHCP for the provisioning network.
With Neutron, Ironic can configure it to direct a node to boot from a specific iPXE server. With Metal3, we have a static DHCP configuration. If we keep it the way it is, it will always direct a node to boot from the iPXE server on the same machine as dnsmasq (regardless of how many dnsmasq replicas we run). This has to be changed. I see several options, none of which are straightforward.
(1) iPXE-level fix: boot config API
This is what https://bugs.launchpad.net/ironic/+bug/2044561 proposes. If we stop using static iPXE scripts and start going through the Ironic API instead, Ironic can reach out to the correct conductor internally, always serving the right script.
(2) dnsmasq DHCP implementation
Ironic now has a non-neutron managed DHCP implementation through DHCP host and options files: https://opendev.org/openstack/ironic/src/branch/master/ironic/dhcp/dnsmasq.p.... This could be a half of the solution. Since Metal3 does not support new node auto-discovery, we can run 3 copies of dnsmasq with dhcp-ignore=!known and rely on this DHCP provider to configure only 1 dnsmasq for each node. The other 2 will ignore the requests.
The problem is in the DHCP range. As far as I understand, the 3 dnsmasqs must be configured with disjoint ranges to prevent leases from clashing. This seems non-trivial with Kubernetes which expects to create identical pods. Adam has suggested using annotations from inside ironic-operator, which could work if we find a way to read them inside the pod after it has started (which seems to pose a chicken-and-egg problem).
The openstack-k8s-operators/ironic-operator runs conductors in a stateful set, and arguably the pods really are stateful given the node specific assets that get written out. Stateful sets have index based pod names, which provides a pod index that can be keyed off for pod specific config. I see ironic pods are running either in a DaemonSet or a Deployment. Maybe these could consolidate to only a StatefulSet with a nodeSelector to replicate whatever the DaemonSet is doing? As for the dnsmasq driver, the per-host dhcp-option entries are tagged with a uuid unique to that node. If I'm reading the docs correctly[1] the driver could be modified to add a second tag which is unique to the conductor (maybe even the conductor host name). Then a disjoint dhcp-range entry can be tagged with that same conductor tag. You could even write the same multiple dhcp-range entries to all pods and the correct range for a node will entirely be driven by tags (and the above about StatefulSet can be ignored). Finally, we may as well have the discussion now about how there are 2 operators called ironic-operator with distinct and valid use cases. I am very much looking forward to using dtantsur/ironic-operator but sure I won't be alone in finding it profoundly confusing to know which is what once it moves out of the dtantsur github into (presumably) metal3-io. Could dtantsur/ironic-operator possibly be renamed? I know it sucks because the current name describes exactly what it does. The name of openstack-k8s-operators/ironic-operator is bound by other operators in that project, and is too far down the productization roadmap to be renamed at this stage. [1] https://thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html
Other workarounds are possible if we go down this path. E.g. ironic-operator could split the DHCP range and creating a mapping hostIP:subrange,hostIP2:subrange2 and pass it to pods. Then each pod will pick its subrange based on its IP address. Huge downside: every time a Kubernetes control plane node is replaced or changes its IP, the whole Ironic installation will get restarted.
(2.1) Pre-created leases
Another Metal3 contributor has suggested requiring an IPAM implementation (metal3 actually has one) that will create leases centrally, allowing us to use the same DHCP range. The only problem: how to pass them to dnsmasq in runtime without restarting all pods? We could probably have a small sidecar container next to dnsmasq that watches the Kubernetes API for new leases in IPAM. Or...
(2.2) Dynamic DHCP and Kea
The same contributor is working on Kea support for their deployment: https://hackmd.io/@7RNIJtmvSIeNpSOMTW9oqA/SkWhyr7Hp. There seem to be limitations in the free version, which caused them to write a tiny downstream plugin: https://gitlab.com/Orange-OpenSource/kanod/kanod-kea/-/blob/controlled-lease....
Changing the DHCP server is a dramatic step, I'm not sure if our small community can afford that. It may also pose problems for downstreams like OpenShift, requiring Metal3 to support both servers in the long run.
Any thoughts are welcome.
Dmitry
Hi, On 12/1/23 00:11, Steve Baker wrote:
On 30/11/23 04:22, Dmitry Tantsur wrote:
Hi folks,
<snip>
(2) dnsmasq DHCP implementation
Ironic now has a non-neutron managed DHCP implementation through DHCP host and options files: https://opendev.org/openstack/ironic/src/branch/master/ironic/dhcp/dnsmasq.p.... This could be a half of the solution. Since Metal3 does not support new node auto-discovery, we can run 3 copies of dnsmasq with dhcp-ignore=!known and rely on this DHCP provider to configure only 1 dnsmasq for each node. The other 2 will ignore the requests.
The problem is in the DHCP range. As far as I understand, the 3 dnsmasqs must be configured with disjoint ranges to prevent leases from clashing. This seems non-trivial with Kubernetes which expects to create identical pods. Adam has suggested using annotations from inside ironic-operator, which could work if we find a way to read them inside the pod after it has started (which seems to pose a chicken-and-egg problem).
The openstack-k8s-operators/ironic-operator runs conductors in a stateful set, and arguably the pods really are stateful given the node specific assets that get written out. Stateful sets have index based pod names, which provides a pod index that can be keyed off for pod specific config. I see ironic pods are running either in a DaemonSet or a Deployment. Maybe these could consolidate to only a StatefulSet with a nodeSelector to replicate whatever the DaemonSet is doing?
I've considered DaemonSets, but the Kubernetes documentation is, as always, hello-world level, and I have some concerns about the DaemonSet limitations. Most importantly, the headless service. I'm using a service already to achieve easy load-balancing. With a headless service, I'm would lose it, potentially causing BMO to always access the 1st pod (depending on how the Go's HTTP client is implemented). From just reading the docs, I'm also not sure if I can use volumes of type EmptyDir (i.e. not persistent). We use them quite actively now.
As for the dnsmasq driver, the per-host dhcp-option entries are tagged with a uuid unique to that node. If I'm reading the docs correctly[1] the driver could be modified to add a second tag which is unique to the conductor (maybe even the conductor host name). Then a disjoint dhcp-range entry can be tagged with that same conductor tag. You could even write the same multiple dhcp-range entries to all pods and the correct range for a node will entirely be driven by tags (and the above about StatefulSet can be ignored).
Finally, we may as well have the discussion now about how there are 2 operators called ironic-operator with distinct and valid use cases. I am very much looking forward to using dtantsur/ironic-operator but sure I won't be alone in finding it profoundly confusing to know which is what once it moves out of the dtantsur github into (presumably) metal3-io. Could dtantsur/ironic-operator possibly be renamed? I know it sucks because the current name describes exactly what it does. The name of openstack-k8s-operators/ironic-operator is bound by other operators in that project, and is too far down the productization roadmap to be renamed at this stage.
I hear your pain, but I also don't know a good way out. I don't think we have any hope of ever converging the two operators - the user cases and the underlying implementation is too different. I also don't have good ideas for a name (baremetal-operator could be one, but it's already a thing we have despite it not being an operator at all, sigh). Suggestions welcome, but I'll be against something non-descriptive or potentially misleading. The only chance is to expand its scope to also install and manage baremetal-operator. Then we can call it metal3-operator. But the last time I suggested that, the community did not like the idea of managing BMO. I'll try again. Dmitry
[1] https://thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html
Other workarounds are possible if we go down this path. E.g. ironic-operator could split the DHCP range and creating a mapping hostIP:subrange,hostIP2:subrange2 and pass it to pods. Then each pod will pick its subrange based on its IP address. Huge downside: every time a Kubernetes control plane node is replaced or changes its IP, the whole Ironic installation will get restarted.
(2.1) Pre-created leases
Another Metal3 contributor has suggested requiring an IPAM implementation (metal3 actually has one) that will create leases centrally, allowing us to use the same DHCP range. The only problem: how to pass them to dnsmasq in runtime without restarting all pods? We could probably have a small sidecar container next to dnsmasq that watches the Kubernetes API for new leases in IPAM. Or...
(2.2) Dynamic DHCP and Kea
The same contributor is working on Kea support for their deployment: https://hackmd.io/@7RNIJtmvSIeNpSOMTW9oqA/SkWhyr7Hp. There seem to be limitations in the free version, which caused them to write a tiny downstream plugin: https://gitlab.com/Orange-OpenSource/kanod/kanod-kea/-/blob/controlled-lease....
Changing the DHCP server is a dramatic step, I'm not sure if our small community can afford that. It may also pose problems for downstreams like OpenShift, requiring Metal3 to support both servers in the long run.
Any thoughts are welcome.
Dmitry
participants (3)
-
Dan Sneddon
-
Dmitry Tantsur
-
Steve Baker