On 2/26/26 22:14, Sean Mooney wrote:
On 26/02/2026 13:06, Dominik Danelski wrote: changes to api behavior generally require a spec to define the semantics and a new micorverions to opt into one. there are limited exceptions to this but we need to effectively maintain the existing cleanup behavior. https://docs.openstack.org/nova/latest/contributor/microversions.html#when-d...
the api is not retrunign a 500 today and we didnt fail silently to do what were were asked.
my inclination is to say this is borderline but changing the clean up semantic is likely not something that should be done as a bugfix. at least i dont think its really in the scope of https://bugs.launchpad.net/nova/+bug/2132020 the originally bug you described there is that allocation for vms that failed to spwan due to quota issues were not freeded. i the instance goes to error and has an empty instance.host it should not have any allocation in placement.
expanding the scope to modify multi create cleanup on partial failure is a very big extention to the scope of that initial bug. _cleanup_build_artifacts is written today to clean up all the instance in teh build request. if we want to make it do a partial cleanup that should be a sperate change first before the bug you wnated to orgianlly fix.
as it is the blocker for the bug fix I try to introduce to the cleanup function, but I need to know in what direction to proceed: prevent instances from being built until we have confirmed that all of them passed the checks which could trigger the cleanup OR leave the current dispatch flow as is, but only call cleanup on failing instances.
"leave the current dispatch flow as is, but only call cleanup on failing instances." would probably be the more correct apporch if we do modify this.
Thank you for the detailed response. Considering all you wrote here, do you think that moving the call to delete_allocation_for_instance() out of the general _cleanup_build_artifacts() and moving it directly into the exception handling for going over quota in schedule_and_build_instances() could be the right call? This wouldn't complicate this function by a lot (only one call to delete_allocation_for_instance + exception handling, 6 lines total), but would already fix the error without having to bump the API version. This would solve the problem of orphaned allocations for the most common case of the 3, the only one caused by user action rather than backend failure (fill_provider_mapping and ARQ binding failures). At the same time a note explaining this logic and the potential improvement (leaving out dispatched instances out of cleanup, as you said) could be left for the time that a bigger rework warranting a spec and maybe an API bump, for example if cleanup was to be expanded even further. Dominik