Hello everyone, TL;DR If one of the batch-scheduled instances fails at the manager level, do we want all other to fail and be cleaned or only the one causing the issue? Background: When implementing a fix that would delete Placement allocations for VMs that were never correctly scheduled (https://review.opendev.org/c/openstack/nova/+/968446) a reviewer correctly recognised that _cleanup_build_artifacts that further calls my cleaning function is itself called 3 times (all in schedule_and_build_instances): 1. When dealing with going over quota on recheck – that's my interest and it happens before dispatching the build requests to the computes 2 and 3. during schedule loop where an exception causes running _cleanup_build_artifacts after some instances may have been sent to be built on compute using build_and_run_instance The issue comes from the fact that in all cases _cleanup_build_artifacts is provided with the list of ALL scheduled instances, not just the failing one, so any of its effects may be performed on already scheduled instances, if it was called from case 2 or 3. Then, what version is correct? Do we want all instances to fail and be cleaned in bulk schedule when at least one of them failed, or do we want to gracefully handle the failing one and proceed with as many of the rest as we can? Version 1 seems to be the intended cause of events, as this was the first use of the cleanup function which was further borrowed by other authors in later patches. The code as is is inconsistent with both versions: we don't cancel and clean instances after one of them fails, since by that time we have already created a side-effect of sending them to computes (cases 2 and 3). We are also not consistent with the 'graceful failure' case, as in all 3 cases we send all instances to the cleanup function which, among other things, sets the ERROR state on them. I definitely plan to create a patch straightening things out, as it is the blocker for the bug fix I try to introduce to the cleanup function, but I need to know in what direction to proceed: prevent instances from being built until we have confirmed that all of them passed the checks which could trigger the cleanup OR leave the current dispatch flow as is, but only call cleanup on failing instances. With or without my patch, I believe the current code is internally inconsistent. Unfortunately, the cleanup function does not have its docstring and I couldn't find any comment on which behaviour is the expected one. Kind regards Dominik Danelski