[ironic][nova][ptg] Cross-project session around nova-compute startup

14 Mar 2026

      Hey all,

A few releases back we swapped up the model for nova-compute HA in 
Ironic -- mainly, moved it in line with other nova drivers where the 
loss of a single nova-compute service takes down a portion of the cloud, 
but HA is achieved by having other portions of the cloud still available.

This model has serious limitations in an Ironic world; including but not 
limited to:
* Use cases around rebuilding into the same physical machine can never 
be isolated from an outage (including upgrades)
* Even with a reasonable number of nodes managed, nova-compute processes 
managing Ironic nodes can take a long, long time to get online. We're 
talking 15-20 minutes with reports of even worse in some scenarios.

I'd like to improve this story. It's operationally painful to eat a 
multi-minute long outage for a deployment of code to a server -- and 
given my use case of in-place rebuilds, there's no amount of aggregation 
of nodes which can get us clear of this issue. It's also clear that 
going back to a model which would run multiple nova-compute processes to 
manage the same machines is too incompatible with the general nova 
model; so instead I think we can try to tackle a major pain point: the 
slow startup.

Ideas I have thought about to address this include:
* Improving how resources are added to placement for Ironic, doing it 
incrementally from Ironic may reduce the number of calls needed at startup
* Improving placement API interfaces to perhaps allow bulk updates, so 
we aren't doing "N" calls for "N" ironic nodes at startup, as I believe 
we currently do
* Some method of "handoff" between a staged nova-compute service and one 
that is shutting down, so we can avoid the outage in properly 
coordinated update scenarios. (I'm not sure this is possible)

One thing that's clear to me: I don't have the knowledge to pursue this 
alone. My hope is that we can work together at the PTG to determine one 
or more good, potential directions, then put together a spec during the 
Hibiscus cycle which could be implemented for "I" (not Icehouse :D).

As long as we have a commitment from nova cores to perform reviews and 
help with the design, I should be also to provide resources (myself + 
maybe others) from the GR-OSS team to help implement it.

Thanks!
Jay Faulkner
Open Source Developer
G-Research Open Source Software

Jay Faulkner

tags

participants (1)