On Wed, 2024-03-20 at 13:06 +0100, Tobias Urdin wrote:
Hello,
This sounds familiar.
If no availability zone was selected when the instance was spawned the “request spec” (saved in the database) does not contain a availability zone set and the scheduler will allow that instance to be scheduled to another availability zone because the original request did not include a specific availability zone.
correct live and cold migration is fully supproted between avaiablity zones provided the operator when installing nova has exchanged ssh keys across all nodes and has not placed a firewall or similar between them as you said if an instnace did not request an az when created, and it was not added by the schduler or a volume (with cross_az_attach=false.) then the requqst_spec will not have an az. schduling by design does not consider the az that the instance is currently on only the one in the request spec. cross az migration is a core feature of nova not a bug and is expected to work by default in any deployment unless the operator has taken messurs to prevent it. AZ in openstack are not fault domains and are not comparable to AWS avaiablity zones. an AWS avaiablity zone is closer to a keyston region then it is to an nova AZ.
If you search for “request spec” on the mailing list you’ll see that there has been multiple threads about that with a lot of details that will help you out.
in this cycle we added the ablity to view the pinned az form the request spec to make understanding this easier. going forward if you use the latest microversion 2.96 instance list and instance show will contain an addtional filed detailing the requsted az if one is set in the request spec. https://docs.openstack.org/nova/latest/reference/api-microversion-history.ht...
When we “migrated” to using availability zones we specifically populated this data in the database (note that it’s an unsupported change so be careful).
yes it is but if done correctly it shoudl not directly break anything. it may be unexpected form a user point of view and then can use shelve to do a cross az unshleve now so they still have a way to force it to change if they need too but the main danager in doign that is this is stored in a json blob in the db its easy to mess up the formating of that blob and result in nova being unable to read it. if you do do this (and im not encuraging poeple to do this this) then if your mysql or postgress is new enough they now have funciton for working with json and that can be safer to use to update the blob in the db then was previously possible. just be sure to take a db backup before making changes like this if you do try.
Best regards Tobias
On 20 Mar 2024, at 12:51, Marc Vorwerk <marc+openstack@marc-vorwerk.de> wrote:
Dear OpenStack Community,
I am reaching out for support with an issue that is specifically affecting a single instance during migration or resize operations in our OpenStack environment. I want to emphasize that this problem is isolated and does not reflect a broader issue within our cloud setup.
The issue arises when attempting a resize of the instance's flavor, which only differs in RAM+CPU specification. Unexpectedly, the instance attempts to switch its availability zone from az1 to az2, which is not the intended behavior.
The instance entered an error state during the resize or migration process, with a fault message indicating 'ImageNotFound', because after the availability zone change the volume cant be reached. We use seperate ceph clusters per az.
if this was a cinder volume then that indicates that indicates that you have not correctly configured your cluster by default nova expect that all cinder backbend are accessible by all hosts, incindentally nova also expect the same to be true fo all neutron networks by default. where that is not the case for cinder volume you need to set [cinder]cross_az_attach=false https://docs.openstack.org/nova/latest/configuration/config.html#cinder.cros...
As noted above this is a feature not a bug. the ability for an unpined instance to change az at scheduling time is an intended behavior that is often not expect by people coming form aws but is expect to work by long time openstack users and operators. that default to true as there is no expecation in general that nova aviabality zones align in any way to cinder aviabality zones. if you chosoe to make them align then you can use that option to enforce affinity but that is not expected to be the case in general. there are no az affintiy config option for neutron as a gain neutron netwroks are expecteed to span all hosts. if you use the l3 routed network feature in neutron you can create an affintiy betwen l3 segments and host via physnets howver that has no relationship to azs. AZ in nova cinder and neutron are not modeling the same thing and while they can aline are not requred to as i said above. for images_type=rbd thre is no native schduling supprot to prevent you moving betwen backends. we have discussed ways to do that in the past but never implemnted it. if you want to prevent host changign cluster with the scheduelr today when using images_type=rbd you have 2 options 1.) you can model each ceph cluster as a seperate nova cell, by default we do not allow instance ot change cell so if you align you ceph clusters to cell boundaryies then instance will never be schduled to a host connected to a diffent ceph cluster. 2.) the other option is to manually configre the sechueler to enforce this there are several ways to do this via a schduler filter and host aggrate metadta to map a flavor/image/tenant to a host aggrate, alternitivaly you can use the required traits functionality of placement via the isolating aggreats feature. https://docs.openstack.org/nova/latest/reference/isolate-aggregates.html effectivly you can use the provider.yaml that advertises CUSTOM_CEPH_CLUSER_1 or CUSTOM_CEPH_CLUSER_2 on the relevent hots vai provider.yaml https://docs.openstack.org/nova/latest/admin/managing-resource-providers.htm... then you would create a host aggate per set of host to encorece the required custom trait and modify your flavor/images to request the relevent trait. unfortunately both approaches are hard to impalement for existing deployments, this is really something best planned for and executed when you are commissioning a cloud for the first time. what i have wanted to do for a long time but not had the time to propose or implement is have nova model the ceph cluster and storage backend in use in placement so we can automatically schdule on it. i more or less know what woudl be required to do that but while this is an ocational pain point for oeprators its a long understood limitation and not wone that has been prioritised to address. if this is something that people are interested in seeing adressed for images_type=rbd specificaly then feedback form operators that they care about this would be appricated but i cannot commit to adressing this in the short to medium term. for now my recomendtaion is if you are deployin a new cloud and have image_type=rbd and you plan to have multiple ceph clusters that are not asscabel by all hosts then you should create one nova cell per ceph cluster. cells are relitivly cheap to create in nova, you can share the same database server/rabbitmq instance between cells if you are not using cells for scaling and you can change that after the fact if you later find you need to scale. you can also colocate mutlile conductors on the scame host for diffent cells provided your instalation tool can accomidate that. we do that in devstack and its perfectly fine to do in production. cells are primarly a scaling/sharding mechanium in nova but and be helpful for this usecase too. if you do have one cell per ceph cluster you can also create one AZ per cell if you want to allow end users to choose the cluster but that is optionaly AZ can span cells and cells can contain multiple diffent az, both concepts are entirly unrelated in nova. cells are an architecutal choose for scaling nova to 1000s of compute nodes az are just a lable on a hosts aggate with no other meaning. neither are falut domains but both are often incorrectly assumed to be. cells should not be required for this usecase to automatically work out of the box but no one in the community has ever had time to work on the correct long term solution to model storage backend in placement. that is sad as that was one of the orgianl primary usecases that placement was created to solve.
To debug this issue we enabled debug logs for the nova scheduler and found that the scheduler does not filter out any node with any of our enabled filter plugins. As a quick test we disabled a couple of compute nodes and could also verify that the ComputeFilter of the nova-scheduler still just returns the full list of all nodes in the cluster. So it seems to us that for some reason the nova-scheduler just ignores all enabled filter plugins for mentioned instance. It's worth noting that other instances with the same flavor on the same compute node do not exhibit these issues, highlighting the unique nature of this problem.
Furthermore we checked all relevant database tables to see if for some reason something strange is saved for this instance but it seems to us that the instance has exactly the same attributes as other instances on this node.
We are seeking insights or suggestions from anyone who might have experienced similar issues or has knowledge of potential causes and solutions. What specific logs or configuration details would be helpful for us to provide to facilitate further diagnosis?
We greatly appreciate any guidance or assistance the community can offer.
Best regards, Marc Vorwerk + Maximilian Stinsky