Re: [dev][nova] Suspected inconsistency in cleanup of failed batch schedule

27 Feb 2026

      On 2/26/26 22:14, Sean Mooney wrote:
...
On 26/02/2026 13:06, Dominik Danelski wrote:
changes to api behavior generally require a spec to define the 
semantics and
a new micorverions to opt into one.
there are limited exceptions to this but we need to effectively
maintain the existing cleanup behavior.
https://docs.openstack.org/nova/latest/contributor/microversions.html#when-d...
the api is not retrunign a 500 today and we didnt fail silently to do 
what were were asked.
my inclination is to say this is borderline but changing the clean up 
semantic is likely not
something that should be done as a bugfix.
at least i dont think its really in the scope of 
https://bugs.launchpad.net/nova/+bug/2132020
the originally bug you described there is that allocation for vms that 
failed to spwan due to quota issues
were not freeded. i the instance goes to error and has an empty 
instance.host it should not have any
allocation in placement.
expanding the scope to modify multi create cleanup on partial failure 
is a very big extention to the scope
of that initial bug. _cleanup_build_artifacts is written today to 
clean up all the instance in teh build request.
if we want to make it do a partial cleanup that should be a sperate 
change first before the bug you wnated to orgianlly
fix.
...
  as it is
the blocker for the bug fix I try to introduce to the cleanup function,
but I need to know in what direction to proceed: prevent instances from
being built until we have confirmed that all of them passed the checks
which could trigger the cleanup OR leave the current dispatch flow as
is, but only call cleanup on failing instances.
"leave the current dispatch flow as is, but only call cleanup on 
failing instances."
would probably be the more correct apporch if we do modify this.
Thank you for the detailed response.

Considering all you wrote here, do you think that moving the call to 
delete_allocation_for_instance() out of the general 
_cleanup_build_artifacts() and moving it directly into the exception 
handling for going over quota in schedule_and_build_instances() could be 
the right call? This wouldn't complicate this function by a lot (only 
one call to delete_allocation_for_instance + exception handling, 6 lines 
total), but would already fix the error without having to bump the API 
version. This would solve the problem of orphaned allocations for the 
most common case of the 3, the only one caused by user action rather 
than backend failure (fill_provider_mapping and ARQ binding failures).

At the same time a note explaining this logic and the potential 
improvement (leaving out dispatched instances out of cleanup, as you 
said) could be left for the time that a bigger rework warranting a spec 
and maybe an API bump, for example if cleanup was to be expanded even 
further.

Dominik