On Wed, 2024-03-20 at 16:01 +0100, Marc Vorwerk wrote:
Greetings,
thanks for the answers.
We are aware of the limitation of ceph clusters per availability zones and configured the cloud on bootstrap to accommodate those. In our setup we created one ceph cluster per availability zone and have novas and cinders az's alligned. Also we defined a defalut availability zone in the nova configuration file. Then we have a host aggregate for each availability zone which is "tagged" with the metadata availability_zone which points to the correct one. Nova is also not allowed to cross attach volumes between az's via cross_az_attach = False in our setup.
The problem still remains, its a single instance which experiences this behaviour. The instance has all necessary metadata for the az that it is currently in (OS-EXT-AZ:availability_zone az1). We also checked in the request_spec table of nova_api and there the instance is located in the correct az. '.["nova_object.data"]["availability_zone"]' -> "az1" for that specific instance_uuid. I would like to focus on the case that even when we disable any compute node in the cluster the ComputeFilter of the scheduler is not filtering out any server for that instance. Here is an example log of the scheduler with its filtering for this instance: Mar 15, 2024 @ 11:22:17.000 Starting with 146 host(s) Mar 15, 2024 @ 11:22:17.000 Filter PciPassthroughFilter returned 146 host(s) Mar 15, 2024 @ 11:22:17.000 Filter ServerGroupAffinityFilter returned 146 host(s) Mar 15, 2024 @ 11:22:17.000 Filter ServerGroupAntiAffinityFilter returned 146 host(s) Mar 15, 2024 @ 11:22:17.000 Filter ImagePropertiesFilter returned 146 host(s) Mar 15, 2024 @ 11:22:17.000 Filter ComputeCapabilitiesFilter returned 146 host(s) Mar 15, 2024 @ 11:22:17.000 Filter ComputeFilter returned 146 host(s) Mar 15, 2024 @ 11:22:17.000 Filter AvailabilityZoneFilter returned 146 host(s) Mar 15, 2024 @ 11:22:17.000 Filter AggregateMultiTenancyIsolation returned 146 host(s) Mar 15, 2024 @ 11:22:17.000 Filter AggregateInstanceExtraSpecsFilter returned 146 host(s)
Not a single filter did anything for that instance even though compute nodes are disabled to test.
do you only have 146 comptue nodes placement can filter out disabled compute nodes and to az enforcement before you get to the filters in the scheduler do you have https://docs.openstack.org/nova/yoga/configuration/config.html#scheduler.que... with regards to the compute status in yoga the compute status prefilter which checks for disable state is uncondtionally enabled @trace_request_filter def compute_status_filter(ctxt, request_spec): """Pre-filter compute node resource providers using COMPUTE_STATUS_DISABLED The ComputeFilter filters out hosts for compute services that are disabled. Compute node resource providers managed by a disabled compute service should have the COMPUTE_STATUS_DISABLED trait set and be excluded by this mandatory pre-filter. """ trait_name = os_traits.COMPUTE_STATUS_DISABLED request_spec.root_forbidden.add(trait_name) LOG.debug('compute_status_filter request filter added forbidden ' 'trait %s', trait_name) return True https://github.com/openstack/nova/blob/unmaintained/yoga/nova/scheduler/requ... the compute fitleter in yoga and later is only used to eliminate host if the compute service is marked as down not for filtering host based on if there status is disabled. so if you are disabling the host its expected that the ComputeFilter does not remove it as placement will do that before we get to any filter. you could use the force down api to see the compute filter remove it or you could stop the nova-compute process on the host and wait for the healthcheck to time out. if query_placement_for_availability_zone is set to true which is the default in yoga you do not need the AvailabilityZoneFilter either as that is entirly handeled by placement. the AvailabilityZoneFilter is only required in yoga if query_placement_for_availability_zone=false
So we are extremely confused on what is going on with that instance which not a single other instance is experiencing. I would also like to emphasize that the cloud is running with this exact setup (single ceph cluster per az) since Pike and we are currently on Yoga and never saw this behaviour before.
the compute status placement request filter was introduced in train and was never something that you could disable https://github.com/openstack/nova/commit/168d34c8d1161dc4d62493e194819297e07... we deprecated the AZ filter and made the placement version the default in xena https://github.com/openstack/nova/commit/7c7a2a142d74a7deeda2a79baf21b689fe3... so you should have seen both changes prior to moving to yoga unless you moved directly there form pike. or had disabled in in xena. both feature had upgrade release notes advertising the change ill also not that the AvailabilityZoneFilter has been removed in a later release (zed or antelope). from reviewing the logs aboved assumign you are using the default config options its working as expected as placement filtered the hosts before it got to the schduler.
Best Regards Marc Vorwerk + Maximilian Stinsky
Am Mittwoch, März 20, 2024 14:20 CET, schrieb smooney@redhat.com: On Wed, 2024-03-20 at 13:06 +0100, Tobias Urdin wrote:
Hello,
This sounds familiar.
If no availability zone was selected when the instance was spawned the “request spec” (saved in the database) does not contain a availability zone set and the scheduler will allow that instance to be scheduled to another availability zone because the original request did not include a specific availability zone.
correct live and cold migration is fully supproted between avaiablity zones provided the operator when installing nova has exchanged ssh keys across all nodes and has not placed a firewall or similar between them
as you said if an instnace did not request an az when created, and it was not added by the schduler or a volume (with cross_az_attach=false.) then the requqst_spec will not have an az. schduling by design does not consider the az that the instance is currently on only the one in the request spec.
cross az migration is a core feature of nova not a bug and is expected to work by default in any deployment unless the operator has taken messurs to prevent it. AZ in openstack are not fault domains and are not comparable to AWS avaiablity zones. an AWS avaiablity zone is closer to a keyston region then it is to an nova AZ.
If you search for “request spec” on the mailing list you’ll see that there has been multiple threads about that with a lot of details that will help you out.
in this cycle we added the ablity to view the pinned az form the request spec to make understanding this easier. going forward if you use the latest microversion 2.96 instance list and instance show will contain an addtional filed detailing the requsted az if one is set in the request spec. https://docs.openstack.org/nova/latest/reference/api-microversion-history.ht...
When we “migrated” to using availability zones we specifically populated this data in the database (note that it’s an unsupported change so be careful).
yes it is but if done correctly it shoudl not directly break anything. it may be unexpected form a user point of view and then can use shelve to do a cross az unshleve now so they still have a way to force it to change if they need too but the main danager in doign that is this is stored in a json blob in the db its easy to mess up the formating of that blob and result in nova being unable to read it. if you do do this (and im not encuraging poeple to do this this) then if your mysql or postgress is new enough they now have funciton for working with json and that can be safer to use to update the blob in the db then was previously possible.
just be sure to take a db backup before making changes like this if you do try.
Best regards Tobias
On 20 Mar 2024, at 12:51, Marc Vorwerk <marc+openstack@marc-vorwerk.de> wrote:
Dear OpenStack Community,
I am reaching out for support with an issue that is specifically affecting a single instance during migration or resize operations in our OpenStack environment. I want to emphasize that this problem is isolated and does not reflect a broader issue within our cloud setup.
The issue arises when attempting a resize of the instance's flavor, which only differs in RAM+CPU specification. Unexpectedly, the instance attempts to switch its availability zone from az1 to az2, which is not the intended behavior.
The instance entered an error state during the resize or migration process, with a fault message indicating 'ImageNotFound', because after the availability zone change the volume cant be reached. We use seperate ceph clusters per az.
if this was a cinder volume then that indicates that indicates that you have not correctly configured your cluster by default nova expect that all cinder backbend are accessible by all hosts, incindentally nova also expect the same to be true fo all neutron networks by default. where that is not the case for cinder volume you need to set [cinder]cross_az_attach=false https://docs.openstack.org/nova/latest/configuration/config.html#cinder.cros...
As noted above this is a feature not a bug. the ability for an unpined instance to change az at scheduling time is an intended behavior that is often not expect by people coming form aws but is expect to work by long time openstack users and operators. that default to true as there is no expecation in general that nova aviabality zones align in any way to cinder aviabality zones.
if you chosoe to make them align then you can use that option to enforce affinity but that is not expected to be the case in general.
there are no az affintiy config option for neutron as a gain neutron netwroks are expecteed to span all hosts. if you use the l3 routed network feature in neutron you can create an affintiy betwen l3 segments and host via physnets howver that has no relationship to azs.
AZ in nova cinder and neutron are not modeling the same thing and while they can aline are not requred to as i said above.
for images_type=rbd thre is no native schduling supprot to prevent you moving betwen backends. we have discussed ways to do that in the past but never implemnted it. if you want to prevent host changign cluster with the scheduelr today when using images_type=rbd you have 2 options
1.) you can model each ceph cluster as a seperate nova cell, by default we do not allow instance ot change cell so if you align you ceph clusters to cell boundaryies then instance will never be schduled to a host connected to a diffent ceph cluster.
2.) the other option is to manually configre the sechueler to enforce this there are several ways to do this via a schduler filter and host aggrate metadta to map a flavor/image/tenant to a host aggrate, alternitivaly you can use the required traits functionality of placement via the isolating aggreats feature. https://docs.openstack.org/nova/latest/reference/isolate-aggregates.html effectivly you can use the provider.yaml that advertises CUSTOM_CEPH_CLUSER_1 or CUSTOM_CEPH_CLUSER_2 on the relevent hots vai provider.yaml https://docs.openstack.org/nova/latest/admin/managing-resource-providers.htm... then you would create a host aggate per set of host to encorece the required custom trait and modify your flavor/images to request the relevent trait.
unfortunately both approaches are hard to impalement for existing deployments, this is really something best planned for and executed when you are commissioning a cloud for the first time.
what i have wanted to do for a long time but not had the time to propose or implement is have nova model the ceph cluster and storage backend in use in placement so we can automatically schdule on it. i more or less know what woudl be required to do that but while this is an ocational pain point for oeprators its a long understood limitation and not wone that has been prioritised to address.
if this is something that people are interested in seeing adressed for images_type=rbd specificaly then feedback form operators that they care about this would be appricated but i cannot commit to adressing this in the short to medium term. for now my recomendtaion is if you are deployin a new cloud and have image_type=rbd and you plan to have multiple ceph clusters that are not asscabel by all hosts then you should create one nova cell per ceph cluster. cells are relitivly cheap to create in nova, you can share the same database server/rabbitmq instance between cells if you are not using cells for scaling and you can change that after the fact if you later find you need to scale. you can also colocate mutlile conductors on the scame host for diffent cells provided your instalation tool can accomidate that. we do that in devstack and its perfectly fine to do in production. cells are primarly a scaling/sharding mechanium in nova but and be helpful for this usecase too. if you do have one cell per ceph cluster you can also create one AZ per cell if you want to allow end users to choose the cluster but that is optionaly
AZ can span cells and cells can contain multiple diffent az, both concepts are entirly unrelated in nova. cells are an architecutal choose for scaling nova to 1000s of compute nodes az are just a lable on a hosts aggate with no other meaning. neither are falut domains but both are often incorrectly assumed to be.
cells should not be required for this usecase to automatically work out of the box but no one in the community has ever had time to work on the correct long term solution to model storage backend in placement. that is sad as that was one of the orgianl primary usecases that placement was created to solve.
To debug this issue we enabled debug logs for the nova scheduler and found that the scheduler does not filter out any node with any of our enabled filter plugins. As a quick test we disabled a couple of compute nodes and could also verify that the ComputeFilter of the nova-scheduler still just returns the full list of all nodes in the cluster. So it seems to us that for some reason the nova-scheduler just ignores all enabled filter plugins for mentioned instance. It's worth noting that other instances with the same flavor on the same compute node do not exhibit these issues, highlighting the unique nature of this problem.
Furthermore we checked all relevant database tables to see if for some reason something strange is saved for this instance but it seems to us that the instance has exactly the same attributes as other instances on this node.
We are seeking insights or suggestions from anyone who might have experienced similar issues or has knowledge of potential causes and solutions. What specific logs or configuration details would be helpful for us to provide to facilitate further diagnosis?
We greatly appreciate any guidance or assistance the community can offer.
Best regards, Marc Vorwerk + Maximilian Stinsky