[dev][nova] Suspected inconsistency in cleanup of failed batch schedule
Hello everyone, TL;DR If one of the batch-scheduled instances fails at the manager level, do we want all other to fail and be cleaned or only the one causing the issue? Background: When implementing a fix that would delete Placement allocations for VMs that were never correctly scheduled (https://review.opendev.org/c/openstack/nova/+/968446) a reviewer correctly recognised that _cleanup_build_artifacts that further calls my cleaning function is itself called 3 times (all in schedule_and_build_instances): 1. When dealing with going over quota on recheck – that's my interest and it happens before dispatching the build requests to the computes 2 and 3. during schedule loop where an exception causes running _cleanup_build_artifacts after some instances may have been sent to be built on compute using build_and_run_instance The issue comes from the fact that in all cases _cleanup_build_artifacts is provided with the list of ALL scheduled instances, not just the failing one, so any of its effects may be performed on already scheduled instances, if it was called from case 2 or 3. Then, what version is correct? Do we want all instances to fail and be cleaned in bulk schedule when at least one of them failed, or do we want to gracefully handle the failing one and proceed with as many of the rest as we can? Version 1 seems to be the intended cause of events, as this was the first use of the cleanup function which was further borrowed by other authors in later patches. The code as is is inconsistent with both versions: we don't cancel and clean instances after one of them fails, since by that time we have already created a side-effect of sending them to computes (cases 2 and 3). We are also not consistent with the 'graceful failure' case, as in all 3 cases we send all instances to the cleanup function which, among other things, sets the ERROR state on them. I definitely plan to create a patch straightening things out, as it is the blocker for the bug fix I try to introduce to the cleanup function, but I need to know in what direction to proceed: prevent instances from being built until we have confirmed that all of them passed the checks which could trigger the cleanup OR leave the current dispatch flow as is, but only call cleanup on failing instances. With or without my patch, I believe the current code is internally inconsistent. Unfortunately, the cleanup function does not have its docstring and I couldn't find any comment on which behaviour is the expected one. Kind regards Dominik Danelski
Hello everyone,
TL;DR If one of the batch-scheduled instances fails at the manager level, do we want all other to fail and be cleaned or only the one causing the issue? for error that happen after we have dispatced the server to the computes i would expect this behvior to be contoled by min and max instances in
On 26/02/2026 13:06, Dominik Danelski wrote: the request. if you have min=2 and max=4 and 2 succeeded regradless of if the other two failed we shoudl not delete the 2 that succeded if we do not reach the min number requested the over all request will have failed and its more corect to clean up the ful set.
Background: When implementing a fix that would delete Placement allocations for VMs that were never correctly scheduled (https://review.opendev.org/c/openstack/nova/+/968446) a reviewer correctly recognised that _cleanup_build_artifacts that further calls my cleaning function is itself called 3 times (all in schedule_and_build_instances): 1. When dealing with going over quota on recheck – that's my interest and it happens before dispatching the build requests to the computes
2 and 3. during schedule loop where an exception causes running _cleanup_build_artifacts after some instances may have been sent to be built on compute using build_and_run_instance
The issue comes from the fact that in all cases _cleanup_build_artifacts is provided with the list of ALL scheduled instances, not just the failing one, so any of its effects may be performed on already scheduled instances, if it was called from case 2 or 3.
Then, what version is correct? Do we want all instances to fail and be cleaned in bulk schedule when at least one of them failed, or do we want to gracefully handle the failing one and proceed with as many of the rest as we can? Version 1 seems to be the intended cause of events, as this was the first use of the cleanup function which was further borrowed by other authors in later patches.
The code as is is inconsistent with both versions: we don't cancel and clean instances after one of them fails, since by that time we have already created a side-effect of sending them to computes (cases 2 and 3). We are also not consistent with the 'graceful failure' case, as in all 3 cases we send all instances to the cleanup function which, among other things, sets the ERROR state on them.
I definitely plan to create a patch straightening things out, changes to api behavior generally require a spec to define the semantics and a new micorverions to opt into one.
looking at each place https://github.com/openstack/nova/blob/889e3d83f68110b55ff082e04db0c4e4f84ff... we appear to clean up all instance if we fail the quota check before we call the computes to build the instances. if we are over quota before we call the compute to creat any instnaces it looks correct to clean up all the instnaces. in the second case, at this point we are in the loop over requested instances, we still have not called the compute for the current instnace but we may be on the 2nd or subsequent iteration of the loop https://github.com/openstack/nova/blob/889e3d83f68110b55ff082e04db0c4e4f84ff... the third place is alos in the loop boday https://github.com/openstack/nova/blob/889e3d83f68110b55ff082e04db0c4e4f84ff... for 2 and 3 what we reraise the exception that triggered the failure regardless of what list of instance we pass. a posible way to modify htis logic would be to constrcut a set of the pending_isntances before the loop and remove the dispatched instances form the pending_instnaces here at the end of the loop https://github.com/openstack/nova/blob/889e3d83f68110b55ff082e04db0c4e4f84ff... then modify 2 an 3 to pass pending_isntances (which will contain the current instance that failed) to _cleanup_build_artifacts this however does change the semantics of the api on failure. its actually more complicated then that since _cleanup_build_artifacts takes not just a list of instances but also build_request and request_specs and internally we zip those together so we would have to filter those as well since they are correlated on the index and need to remain the same lenght there are limited exceptions to this but we need to effectively maintain the existing cleanup behavior. https://docs.openstack.org/nova/latest/contributor/microversions.html#when-d... the api is not retrunign a 500 today and we didnt fail silently to do what were were asked. my inclination is to say this is borderline but changing the clean up semantic is likely not something that should be done as a bugfix. at least i dont think its really in the scope of https://bugs.launchpad.net/nova/+bug/2132020 the originally bug you described there is that allocation for vms that failed to spwan due to quota issues were not freeded. i the instance goes to error and has an empty instance.host it should not have any allocation in placement. expanding the scope to modify multi create cleanup on partial failure is a very big extention to the scope of that initial bug. _cleanup_build_artifacts is written today to clean up all the instance in teh build request. if we want to make it do a partial cleanup that should be a sperate change first before the bug you wnated to orgianlly fix.
as it is the blocker for the bug fix I try to introduce to the cleanup function, but I need to know in what direction to proceed: prevent instances from being built until we have confirmed that all of them passed the checks which could trigger the cleanup OR leave the current dispatch flow as is, but only call cleanup on failing instances.
"leave the current dispatch flow as is, but only call cleanup on failing instances." would probably be the more correct apporch if we do modify this.
With or without my patch, I believe the current code is internally inconsistent. Unfortunately, the cleanup function does not have its docstring and I couldn't find any comment on which behaviour is the expected one.
as i said initially if we have min=2 max=4 and 2 boot successfully i would expect there to be 4 nova server rows in the db 2 related to active vm and 2 in error either in cell0 or a real cell again deepening on where the failure happened. if only 1 vm booted and all the rest had error on the compute i would expect all fo them to be cleaned up. the doc sting for multi create eludes to the point you are raising. ``` There is a second kind of create call which can build multiple servers at once. This supports all the same parameters as create with a few additional attributes specific to multiple create. Error handling for multiple create is not as consistent as for single server create, and there is no guarantee that all the servers will be built. This call should generally be avoided in favor of clients doing direct individual server creates. ``` https://docs.openstack.org/api-ref/compute/#create-multiple-servers min defaults to 1 if not provided and max default to min if not provided so this api degrades to the standard single instance flow. failure that happen in the conductor today however all seam to clean up all the vms so whtere that is the intuitive behavior or not that is the existing behavior.
Kind regards
Dominik Danelski
On 2/26/26 22:14, Sean Mooney wrote:
On 26/02/2026 13:06, Dominik Danelski wrote: changes to api behavior generally require a spec to define the semantics and a new micorverions to opt into one. there are limited exceptions to this but we need to effectively maintain the existing cleanup behavior. https://docs.openstack.org/nova/latest/contributor/microversions.html#when-d...
the api is not retrunign a 500 today and we didnt fail silently to do what were were asked.
my inclination is to say this is borderline but changing the clean up semantic is likely not something that should be done as a bugfix. at least i dont think its really in the scope of https://bugs.launchpad.net/nova/+bug/2132020 the originally bug you described there is that allocation for vms that failed to spwan due to quota issues were not freeded. i the instance goes to error and has an empty instance.host it should not have any allocation in placement.
expanding the scope to modify multi create cleanup on partial failure is a very big extention to the scope of that initial bug. _cleanup_build_artifacts is written today to clean up all the instance in teh build request. if we want to make it do a partial cleanup that should be a sperate change first before the bug you wnated to orgianlly fix.
as it is the blocker for the bug fix I try to introduce to the cleanup function, but I need to know in what direction to proceed: prevent instances from being built until we have confirmed that all of them passed the checks which could trigger the cleanup OR leave the current dispatch flow as is, but only call cleanup on failing instances.
"leave the current dispatch flow as is, but only call cleanup on failing instances." would probably be the more correct apporch if we do modify this.
Thank you for the detailed response. Considering all you wrote here, do you think that moving the call to delete_allocation_for_instance() out of the general _cleanup_build_artifacts() and moving it directly into the exception handling for going over quota in schedule_and_build_instances() could be the right call? This wouldn't complicate this function by a lot (only one call to delete_allocation_for_instance + exception handling, 6 lines total), but would already fix the error without having to bump the API version. This would solve the problem of orphaned allocations for the most common case of the 3, the only one caused by user action rather than backend failure (fill_provider_mapping and ARQ binding failures). At the same time a note explaining this logic and the potential improvement (leaving out dispatched instances out of cleanup, as you said) could be left for the time that a bigger rework warranting a spec and maybe an API bump, for example if cleanup was to be expanded even further. Dominik
participants (2)
-
Dominik Danelski
-
Sean Mooney