Greetings everyone! Last week Ironic had a wildly successful PTG! We averaged around fifteen attendees most days, we drew upwards of 22 attendees for most of Wednesday while we discussed a number of related topics around networking. Overall, networking aspects seem to show the most current interest. This is the result in a change in market ecosystem where vendors are less focused on SDN integrations, and also deliberate limitations inside of ironic in order to limit scope creep and still attempt to meet the majority of infrastructure operator requirements when the original networking multi-tenancy effort was executed back in 2015-2016. Other major topics were Eventlet, some operator feedback on quarks of features, which highlighted some possible bugs and areas for improvement. We also got into some areas which have been a bit contentious in the past, but we reached some reasonable compromises which allowed us to find better paths forward. Monday: We largely discussed where we were at and where we were going. To highlight, ironic-lib has now been retired. Redfish based Graphical Console support merged and we identified some more work was likely needed. Bootable Container support *and* support for artifacts from an OCI container registry support was also viewed as completed during the cycle which ultimately improves options to infrastructure operators who are operating in mixed environments or with mixed requirements outside of the classical "everything is a VM in OpenStack" context. We've improved the linting across the majority of project repositories. Out of band inspection rules likely need more work around ensuring data structures are what we expect, but otherwise we are re-affirming our deprecation of ironic-inspector as a standalone project. Work to support Kea as a DHCP backend has paused for now. The plan exists, but realistically the contributor working in that area has a different focus at the moment which is okay! Container based IPA steps also didn't make it into the Epoxy cycle release, but are almost done already in Flamingo. Schema validation and OpenAPI work is also still underway and we broadly expect this to merge into Flamingo. In-band disk encryption as part of the ``direct`` deploy interface also didn't make any forward progress due to shifting contributor priorities. As a note, the bootable container work did extend this as a possible option, however it remains unclear if that will solve the overall need for the contributor who did propose adding support for encrypted volumes to the ``direct`` interface. Efforts to find an alternative to TinyCore usage in CI also stalled out. Turns out is less of a trivial problem to make a super-low memory ramdisk, and other fixes to the build jobs improved general reliability in the meantime so it is less pressing. We also didn't make any progress on Project Mercury as it was likely framed to try and keep too many options open for operators while we also lost contributor velocity due to some CVE work this past cycle. Having discussed what we achieved, this allowed us to focus on key aspects contributors know they needed to focus on this next cycle. Networking was the biggest topic in this area. Promptly followed by eventlet removal. There was also a consensus that we needed to ensure we're working together a bit better, but also don't explicitly block any one aspect, as there is power in user/operator choice as well. Some possible interest was also expressed around extending network device firmware upgrades, further delineating/improving metrics, possibly supporting more of a "push" firmware update model which more vendors are adopting. We promptly shifted to CI, and discussed challenges and path forward. Generally, there is some interest in trying to make some of our job executions a bit more selective, and also to dial back our reliance upon integration scenario jobs. Overall, it is a broad area of work which will require some further discussion to try and frame an ideal future state. In this we also discussed Centos Stream 10, python versions, and ultimately possible paths forward to at least highlighting other options which may enable maintenance activities to ensure these cases work. Afterwards, we shifted to CI knowledge sharing later in the day outside of the PTG schedule, to help spread overall context among newer members of the team. Tuesday! We started Tuesday with the topic of supporting a use case desired by some operators wishing to "allocate" baremetal nodes into a state which, to Ironic signifies the node has been deployed. The use case behind that is a bit odder, more for research and academic cases where other tools may be desired to meet very specific requirements. The discussion yielded an operator in the scientific space which could benefit, so some ideas and a possible path was identified. Then we got into the world of eventlet and a mini-retrospective of the challenges and identification of future steps. What we thought would be simple, turned out to be the hardest problem. Ironic Python Agent, originally identified as a good candidate for an early eventlet migration, has turned out to be extremely difficult due to it's need to spawn it's own WSGI server late in the process. Upon further reflection; Ironic itself has this issue as well, because we actually have two places a WSGI server might be spun up: ironic-api (obviously) and the conductor when using json-rpc. It was clear that solving that technical problem was the first step in our migration. We found some examples of gunicorn used in this way; that may be what we look to for an initial prototype. We then spoke about node / hardware monitoring patterns that operators are using and interested in. Redfish based hardware supports sending a notification to a web service on a threshold violation so we discussed adding a listener into Ironic that could receive these and integrate this into the current hardware event notifications. We agreed that a spec would be necessary and we would craft one and proceed from there. Further improvements around making it easier for operators to separate monitoring/events of the services from the hardware itself was discussed but oslo_messaging does not allow for this type of split routing today. We spoke about looking further into ironic-prometheus-exporter to see how it could be configured to potentially support this separation. To wrap up Tuesday, we drifted into discussion of the serial console support which Ironic has today. In essence, there is quite a bit of room for possible improvement there and some operators who are not using Nova wouldn't mind this area to see attention moving forward. The broad idea is we might explore creation of a new interface which enables SSH connections to be proxied through to the IPMI Serial-over-LAN capabilities in a BMC. This distinctly would require IPMI, since there really is not a great answer to serial console/interface access with Redfish. Ultimately this may also mean we extend redfish to support an IPMI interface as well, which seems weird, but discussion was in broad agreement that this space has unique challenges where this makes sense. Some initial ideas were entered into an spec document for discussion and refinement as some of the interested parties are also on the perimeter of the Ironic project. Wednesday! Wednesday was our deep dive into networking! Our very first topic was focused on improving the capabilities around the intentional limitations which were encoded in Ironic's early multi-tenant networking work. The early work was mainly focused on ensuring tenant level isolation on to separate L2 domains, such as VLANs. In large part, this boils down to the mapping, scheduling, and ultimately the dynamic assembly of a port grouping (bonded interfaces). We came into this session with a specification document from a discussion which occurred a few weeks prior and there was general consensus in this option and discussion highlighted a few other intertwined aspects which need to be accounted for. We then dove into ACL support for Networking Generic Switch. Overall, consensus resulted. This discussion also highlighted a number of aspects we also likely need to keep in mind, and potentially may start best be described as overall needs for additional documentation. There are also some further discussions which may need to be revisited networking wise. Ultimately, this is very much an "integrated" use case which was only appealing to about half of the attendees, which is typical given the variety of use cases Ironic solves and is able to support. Then we reached the topic of Standalone Switch management. This quickly became a retrospective into why we didn't make progress with Mercury, and then drove into what are the real minimal requirements which would need to be taken into account. Some of the discussion revolved around questions which were semi-answered in the mercury plan itself, to try and provide enough overlap to generally be usable in more than one use model/case while also enabling or preventing fragmentation. This discussion also crossed over into support of DPUs and how to support them, in large part because the overall model is very different. Some discussion also sort of shifted into maybe merging and tools together. Ultimately this topic requires much more discussion, and to make solid progress on networking aspects we need to enable ourselves to move together in two directions at once, while not blocking each other. With this in mind, we're forming up into two sub-teams which will focus on each area with their own advocates/champions. The goal being to report back to the community each week in a quick update style of engagement. Thursday! Thursday marked our final day of PTG for Ironic. We started by wrapping up some of the networking discussions, in large part the focus on how to make progress. We used this time to identify our high level organizational plans, determine broad interest levels and ultimately outreach. One key aspect which was highlighted, is the initial primary focus for some will likely be making progress on eventlet before shifting gears into networking. We then shifted gears to discussing interest and requirements of mapping storage volumes to overall hosts via DPU devices, which somewhat crossed over into the networking topics. The broad idea was presented with a potential need to frame compound drivers around facilitating broadly different configuration actions. For example, directly via SSH to invoke commands, or for example update a CRD in a distinct OpenShift cluster. Overall, the discussion yielded that maybe invoking a CRD update was not worthwhile given the flux those areas can experience along with competing requirements. Ultimately this is an area in early discussion and would require investment in time and hardware to move forward, but there were really no objections to doing so if we could model and extend what already exists in a useful and applicable way. We then drifted back into networking with our next topic, bridging distinctly different types of fabric together. This was raised much more as a question to see if there was an interest/need and/or requirement. In essence, this is similar to the original l2gw project. A huge highlight of embracing or extending into a more ideal model is the reality that if this existed, it might not be necessary for VXLAN to be considered on the physical side. Discussions then shifted to improving servicing interactions. Servicing functionality was added to Ironic during one of the recent past development cycles to enable firmware upgrades to take place on deployed nodes. In discussion, it was highlighted that in the current model this could take five or more reboots to facilitate the deployment as the current mechanism is largely modeled on the use of agent heartbeats. The discussion shifted to how we could avoid this. And the obvious answer was "add a periodic". Then concerns over adding more periodic queries and jobs shifted the discussion into how we can address that challenge. Ultimately, this may end up with us in a better place for tasks which need to follow-up on state changes in hardware and revisit the state before proceeding with the next step to execute. As the final topic, we dove into deploy steps. This really boiled down to "how do I know what will occur" when I ask Ironic to do something along with "how do I know what occurred". Consensus reached that largely the key aspect to highlight back to the user is "what was done" or "what do we expect to be done" in terms of steps. An idea was raised to record this into node history, which is a historical record list of actions/errors to a node which has existed in Ironic for a number of cycles now, but is most useful if you're trying to figure out prior events without going to logs. Consensus was that doing it this way would be a "quick win". At some point, Ironic contributors also sat down and reviewed jammed outside of the PTG schedule to review/discuss and merge existing items sitting awaiting reviews. Overall, the week was extremely productive! Thanks to everyone who took part. Additional thanks to everyone who collaborated on this summary, and everyone who also helped push our project priorities published for this work cycle[3]. Please remember if you have action items to follow-up. And onward to Flamingo! - The Ironic Team [0]: https://etherpad.opendev.org/p/ironic-ptg-april-2025 [1]: https://review.opendev.org/c/openstack/ironic-specs/+/945642 [2]: https://review.opendev.org/c/openstack/ironic-specs/+/946723 [3]: https://specs.openstack.org/openstack/ironic-specs/priorities/2025-2-workite...