[dev][nova] Suspected inconsistency in cleanup of failed batch schedule

26 Feb 2026

      Hello everyone,

TL;DR If one of the batch-scheduled instances fails at the manager 
level, do we want all other to fail and be cleaned or only the one 
causing the issue?

Background: When implementing a fix that would delete Placement 
allocations for VMs that were never correctly scheduled 
(https://review.opendev.org/c/openstack/nova/+/968446) a reviewer 
correctly recognised that _cleanup_build_artifacts that further calls my 
cleaning function is itself called 3 times (all in 
schedule_and_build_instances):
1. When dealing with going over quota on recheck – that's my interest 
and it happens before dispatching the build requests to the computes

2 and 3. during schedule loop where an exception causes 
running _cleanup_build_artifacts after some instances may have been sent 
to be built on compute using build_and_run_instance

The issue comes from the fact that in all cases _cleanup_build_artifacts 
is provided with the list of ALL scheduled instances, not just the 
failing one, so any of its effects may be performed on already scheduled 
instances, if it was called from case 2 or 3.

Then, what version is correct? Do we want all instances to fail and be 
cleaned in bulk schedule when at least one of them failed, or do we want 
to gracefully handle the failing one and proceed with as many of the 
rest as we can? Version 1 seems to be the intended cause of events, as 
this was the first use of the cleanup function which was further 
borrowed by other authors in later patches.

The code as is is inconsistent with both versions: we don't cancel and 
clean instances after one of them fails, since by that time we have 
already created a side-effect of sending them to computes (cases 2 and 
3). We are also not consistent with the 'graceful failure' case, as in 
all 3 cases we send all instances to the cleanup function which, among 
other things, sets the ERROR state on them.

I definitely plan to create a patch straightening things out, as it is 
the blocker for the bug fix I try to introduce to the cleanup function, 
but I need to know in what direction to proceed: prevent instances from 
being built until we have confirmed that all of them passed the checks 
which could trigger the cleanup OR leave the current dispatch flow as 
is, but only call cleanup on failing instances.

With or without my patch, I believe the current code is internally 
inconsistent. Unfortunately, the cleanup function does not have its 
docstring and I couldn't find any comment on which behaviour is the 
expected one.

Kind regards

Dominik Danelski

Dominik Danelski

Sean Mooney

Dominik Danelski

tags

participants (2)