Hello, This sounds familiar. If no availability zone was selected when the instance was spawned the “request spec” (saved in the database) does not contain a availability zone set and the scheduler will allow that instance to be scheduled to another availability zone because the original request did not include a specific availability zone. If you search for “request spec” on the mailing list you’ll see that there has been multiple threads about that with a lot of details that will help you out. When we “migrated” to using availability zones we specifically populated this data in the database (note that it’s an unsupported change so be careful). Best regards Tobias
On 20 Mar 2024, at 12:51, Marc Vorwerk <marc+openstack@marc-vorwerk.de> wrote:
Dear OpenStack Community,
I am reaching out for support with an issue that is specifically affecting a single instance during migration or resize operations in our OpenStack environment. I want to emphasize that this problem is isolated and does not reflect a broader issue within our cloud setup.
The issue arises when attempting a resize of the instance's flavor, which only differs in RAM+CPU specification. Unexpectedly, the instance attempts to switch its availability zone from az1 to az2, which is not the intended behavior.
The instance entered an error state during the resize or migration process, with a fault message indicating 'ImageNotFound', because after the availability zone change the volume cant be reached. We use seperate ceph clusters per az.
To debug this issue we enabled debug logs for the nova scheduler and found that the scheduler does not filter out any node with any of our enabled filter plugins. As a quick test we disabled a couple of compute nodes and could also verify that the ComputeFilter of the nova-scheduler still just returns the full list of all nodes in the cluster. So it seems to us that for some reason the nova-scheduler just ignores all enabled filter plugins for mentioned instance. It's worth noting that other instances with the same flavor on the same compute node do not exhibit these issues, highlighting the unique nature of this problem.
Furthermore we checked all relevant database tables to see if for some reason something strange is saved for this instance but it seems to us that the instance has exactly the same attributes as other instances on this node.
We are seeking insights or suggestions from anyone who might have experienced similar issues or has knowledge of potential causes and solutions. What specific logs or configuration details would be helpful for us to provide to facilitate further diagnosis?
We greatly appreciate any guidance or assistance the community can offer.
Best regards, Marc Vorwerk + Maximilian Stinsky