[ironic][nova][ptg] Cross-project session around nova-compute startup
Hey all, A few releases back we swapped up the model for nova-compute HA in Ironic -- mainly, moved it in line with other nova drivers where the loss of a single nova-compute service takes down a portion of the cloud, but HA is achieved by having other portions of the cloud still available. This model has serious limitations in an Ironic world; including but not limited to: * Use cases around rebuilding into the same physical machine can never be isolated from an outage (including upgrades) * Even with a reasonable number of nodes managed, nova-compute processes managing Ironic nodes can take a long, long time to get online. We're talking 15-20 minutes with reports of even worse in some scenarios. I'd like to improve this story. It's operationally painful to eat a multi-minute long outage for a deployment of code to a server -- and given my use case of in-place rebuilds, there's no amount of aggregation of nodes which can get us clear of this issue. It's also clear that going back to a model which would run multiple nova-compute processes to manage the same machines is too incompatible with the general nova model; so instead I think we can try to tackle a major pain point: the slow startup. Ideas I have thought about to address this include: * Improving how resources are added to placement for Ironic, doing it incrementally from Ironic may reduce the number of calls needed at startup * Improving placement API interfaces to perhaps allow bulk updates, so we aren't doing "N" calls for "N" ironic nodes at startup, as I believe we currently do * Some method of "handoff" between a staged nova-compute service and one that is shutting down, so we can avoid the outage in properly coordinated update scenarios. (I'm not sure this is possible) One thing that's clear to me: I don't have the knowledge to pursue this alone. My hope is that we can work together at the PTG to determine one or more good, potential directions, then put together a spec during the Hibiscus cycle which could be implemented for "I" (not Icehouse :D). As long as we have a commitment from nova cores to perform reviews and help with the design, I should be also to provide resources (myself + maybe others) from the GR-OSS team to help implement it. Thanks! Jay Faulkner Open Source Developer G-Research Open Source Software
participants (1)
-
Jay Faulkner