[ironic][ptg] 2025.2 (Flamingo) PTG Summary

16 Apr 2025

      Greetings everyone!

Last week Ironic had a wildly successful PTG! We averaged around
fifteen attendees most days, we drew upwards of 22 attendees for most
of Wednesday while we discussed a number of related topics around
networking.

Overall, networking aspects seem to show the most current interest.
This is the result in a change in market ecosystem where vendors are
less focused on SDN integrations, and also deliberate limitations
inside of ironic in order to limit scope creep and still attempt to
meet the majority of infrastructure operator requirements when the
original networking multi-tenancy effort was executed back in
2015-2016. Other major topics were Eventlet, some operator feedback on
quarks of  features, which highlighted some possible bugs and areas
for improvement. We also got into some areas which have been a bit
contentious in the past, but we reached some reasonable compromises
which allowed us to find better paths forward.

Monday:

We largely discussed where we were at and where we were going.  To
highlight, ironic-lib has now been retired. Redfish based Graphical
Console support merged and we identified some more work was likely
needed. Bootable Container support *and* support for artifacts from an
OCI container registry support was also viewed as completed during the
cycle which ultimately improves options to infrastructure operators
who are operating in mixed environments or with mixed requirements
outside of the classical "everything is a VM in OpenStack" context.
We've improved the linting across the majority of project
repositories. Out of band inspection rules likely need more work
around ensuring data structures are what we expect, but otherwise we
are re-affirming our deprecation of ironic-inspector as a standalone
project. Work to support Kea as a DHCP backend has paused for now. The
plan exists, but realistically the contributor working in that area
has a different focus at the moment which is okay! Container based IPA
steps also didn't make it into the Epoxy cycle release, but are almost
done already in Flamingo. Schema validation and OpenAPI work is also
still underway and we broadly expect this to merge into Flamingo.
In-band disk encryption as part of the ``direct`` deploy interface
also didn't make any forward progress due to shifting contributor
priorities. As a note, the bootable container work did extend this as
a possible option, however it remains unclear if that will solve the
overall need for the contributor who did propose adding support for
encrypted volumes to the ``direct`` interface. Efforts to find an
alternative to TinyCore usage in CI also stalled out. Turns out is
less of a trivial problem to make a super-low memory ramdisk, and
other fixes to the build jobs improved general reliability in the
meantime so it is less pressing. We also didn't make any progress on
Project Mercury as it was likely framed to try and keep too many
options open for operators while we also lost contributor velocity due
to some CVE work this past cycle.

Having discussed what we achieved, this allowed us to focus on key
aspects contributors know they needed to focus on this next cycle.
Networking was the biggest topic in this area. Promptly followed by
eventlet removal. There was also a consensus that we needed to ensure
we're working together a bit better, but also don't explicitly block
any one aspect, as there is power in user/operator choice as well.
Some possible interest was also expressed around extending network
device firmware upgrades, further delineating/improving metrics,
possibly supporting more of a "push" firmware update model which more
vendors are adopting.

We promptly shifted to CI, and discussed challenges and path forward.
Generally, there is some interest in trying to make some of our job
executions a bit more selective, and also to dial back our reliance
upon integration scenario jobs. Overall, it is a broad area of work
which will require some further discussion to try and frame an ideal
future state. In this we also discussed Centos Stream 10, python
versions, and ultimately possible paths forward to at least
highlighting other options which may enable maintenance activities to
ensure these cases work.

Afterwards, we shifted to CI knowledge sharing later in the day
outside of the PTG schedule, to help spread overall context among
newer members of the team.

Tuesday!

We started Tuesday with the topic of supporting a use case desired by
some operators wishing to "allocate" baremetal nodes into a state
which, to Ironic signifies the node has been deployed. The use case
behind that is a bit odder, more for research and academic cases where
other tools may be desired to meet very specific requirements. The
discussion yielded an operator in the scientific space which could
benefit, so some ideas and a possible path was identified.

Then we got into the world of eventlet and a mini-retrospective of the
challenges and identification of future steps. What we thought would
be simple, turned out to be the hardest problem. Ironic Python Agent,
originally identified as a good candidate for an early eventlet
migration, has turned out to be extremely difficult due to it's need
to spawn it's own WSGI server late in the process. Upon further
reflection; Ironic itself has this issue as well, because we actually
have two places a WSGI server might be spun up: ironic-api (obviously)
and the conductor when using json-rpc. It was clear that solving that
technical problem was the first step in our migration. We found some
examples of gunicorn used in this way; that may be what we look to for
an initial prototype.

We then spoke about node / hardware monitoring patterns that operators
are using and interested in. Redfish based hardware supports sending a
notification to a web service on a threshold violation so we discussed
adding a listener into Ironic that could receive these and integrate
this into the current hardware event notifications. We agreed that a
spec would be necessary and we would craft one and proceed from there.
Further improvements around making it easier for operators to separate
monitoring/events of the services from the hardware itself was
discussed but oslo_messaging does not allow for this type of split
routing today. We spoke about looking further into
ironic-prometheus-exporter to see how it could be configured to
potentially support this separation.

To wrap up Tuesday, we drifted into discussion of the serial console
support which Ironic has today. In essence, there is quite a bit of
room for possible improvement there and some operators who are not
using Nova wouldn't mind this area to see attention moving forward.
The broad idea is we might explore creation of a new interface which
enables SSH connections to be proxied through to the IPMI
Serial-over-LAN capabilities in a BMC. This distinctly would require
IPMI, since there really is not a great answer to serial
console/interface access with Redfish. Ultimately this may also mean
we extend redfish to support an IPMI interface as well, which seems
weird, but discussion was in broad agreement that this space has
unique challenges where this makes sense. Some initial ideas were
entered into an spec document for discussion and refinement as some of
the interested parties are also on the perimeter of the Ironic
project.

Wednesday!

Wednesday was our deep dive into networking!

Our very first topic was focused on improving the capabilities around
the intentional limitations which were encoded in Ironic's early
multi-tenant networking work. The early work was mainly focused on
ensuring tenant level isolation on to separate L2 domains, such as
VLANs. In large part, this boils down to the mapping, scheduling, and
ultimately the dynamic assembly of a port grouping (bonded
interfaces). We came into this session with a specification document
from a discussion which occurred a few weeks prior and there was
general consensus in this option and discussion highlighted a few
other intertwined aspects which need to be accounted for.

We then dove into ACL support for Networking Generic Switch. Overall,
consensus resulted. This discussion also highlighted a number of
aspects we also likely need to keep in mind, and potentially may start
best be described as overall needs for additional documentation. There
are also some further discussions which may need to be revisited
networking wise. Ultimately, this is very much an "integrated" use
case which was only appealing to about half of the attendees, which is
typical given the variety of use cases Ironic solves and is able to
support.

Then we reached the topic of Standalone Switch management. This
quickly became a retrospective into why we didn't make progress with
Mercury, and then drove into what are the real minimal requirements
which would need to be taken into account. Some of the discussion
revolved around questions which were semi-answered in the mercury plan
itself, to try and provide enough overlap to generally be usable in
more than one use model/case while also enabling or preventing
fragmentation. This discussion also crossed over into support of DPUs
and how to support them, in large part because the overall model is
very different. Some discussion also sort of shifted into maybe
merging and tools together.

Ultimately this topic requires much more discussion, and to make solid
progress on networking aspects we need to enable ourselves to move
together in two directions at once, while not blocking each other.
With this in mind, we're forming up into two sub-teams which will
focus on each area with their own advocates/champions. The goal being
to report back to the community each week in a quick update style of
engagement.

Thursday!

Thursday marked our final day of PTG for Ironic. We started by
wrapping up some of the networking discussions, in large part the
focus on how to make progress. We used this time to identify our high
level organizational plans, determine broad interest levels and
ultimately outreach. One key aspect which was highlighted, is the
initial primary focus for some will likely be making progress on
eventlet before shifting gears into networking.

We then shifted gears to discussing interest and requirements of
mapping storage volumes to overall hosts via DPU devices, which
somewhat crossed over into the networking topics. The broad idea was
presented with a potential need to frame compound drivers around
facilitating broadly different configuration actions. For example,
directly via SSH to invoke commands, or for example update a CRD in a
distinct OpenShift cluster. Overall, the discussion yielded that maybe
invoking a CRD update was not worthwhile given the flux those areas
can experience along with competing requirements. Ultimately this is
an area in early discussion and would require investment in time and
hardware to move forward, but there were really no objections to doing
so if we could model and extend what already exists in a useful and
applicable way.

We then drifted back into networking with our next topic, bridging
distinctly different types of fabric together. This was raised much
more as a question to see if there was an interest/need and/or
requirement. In essence, this is similar to the original l2gw project.
A huge highlight of embracing or extending into a more ideal model is
the reality that if this existed, it might not be necessary for VXLAN
to be considered on the physical side.

Discussions then shifted to improving servicing interactions.
Servicing functionality was added to Ironic during one of the recent
past development cycles to enable firmware upgrades to take place on
deployed nodes. In discussion, it was highlighted that in the current
model this could take five or more reboots to facilitate the
deployment as the current mechanism is largely modeled on the use of
agent heartbeats. The discussion shifted to how we could avoid this.
And the obvious answer was "add a periodic". Then concerns over adding
more periodic queries and jobs shifted the discussion into how we can
address that challenge. Ultimately, this may end up with us in a
better place for tasks which need to follow-up on state changes in
hardware and revisit the state before proceeding with the next step to
execute.

As the final topic, we dove into deploy steps. This really boiled down
to "how do I know what will occur" when I ask Ironic to do something
along with "how do I know what occurred". Consensus reached that
largely the key aspect to highlight back to the user is "what was
done" or "what do we expect to be done" in terms of steps. An idea was
raised to record this into node history, which is a historical record
list of actions/errors to a node which has existed in Ironic for a
number of cycles now, but is most useful if you're trying to figure
out prior events without going to logs. Consensus was that doing it
this way would be a "quick win".

At some point, Ironic contributors also sat down and reviewed jammed
outside of the PTG schedule to review/discuss and merge existing items
sitting awaiting reviews.

Overall, the week was extremely productive! Thanks to everyone who
took part. Additional thanks to everyone who collaborated on this
summary, and everyone who also helped push our project priorities
published for this work cycle[3].

Please remember if you have action items to follow-up. And onward to Flamingo!

- The Ironic Team

[0]: https://etherpad.opendev.org/p/ironic-ptg-april-2025
[1]: https://review.opendev.org/c/openstack/ironic-specs/+/945642
[2]: https://review.opendev.org/c/openstack/ironic-specs/+/946723
[3]: https://specs.openstack.org/openstack/ironic-specs/priorities/2025-2-workite...

[ironic][ptg] 2025.2 (Flamingo) PTG Summary

Julia Kreger