Greetings,

thanks for the answers.

We are aware of the limitation of ceph clusters per availability zones and configured the cloud on bootstrap to accommodate those.
In our setup we created one ceph cluster per availability zone and have novas and cinders az's alligned. Also we defined a defalut availability zone in the nova configuration file.
Then we have a host aggregate for each availability zone which is "tagged" with the metadata availability_zone which points to the correct one.
Nova is also not allowed to cross attach volumes between az's via cross_az_attach = False in our setup.

The problem still remains, its a single instance which experiences this behaviour. The instance has all necessary metadata for the az that it is currently in (OS-EXT-AZ:availability_zone az1).
We also checked in the request_spec table of nova_api and there the instance is located in the correct az.
'.["nova_object.data"]["availability_zone"]' -> "az1" for that specific instance_uuid.
I would like to focus on the case that even when we disable any compute node in the cluster the ComputeFilter of the scheduler is not filtering out any server for that instance.
Here is an example log of the scheduler with its filtering for this instance:
Mar 15, 2024 @ 11:22:17.000    Starting with 146 host(s)
Mar 15, 2024 @ 11:22:17.000    Filter PciPassthroughFilter returned 146 host(s)
Mar 15, 2024 @ 11:22:17.000    Filter ServerGroupAffinityFilter returned 146 host(s)
Mar 15, 2024 @ 11:22:17.000    Filter ServerGroupAntiAffinityFilter returned 146 host(s)
Mar 15, 2024 @ 11:22:17.000    Filter ImagePropertiesFilter returned 146 host(s)
Mar 15, 2024 @ 11:22:17.000    Filter ComputeCapabilitiesFilter returned 146 host(s)
Mar 15, 2024 @ 11:22:17.000    Filter ComputeFilter returned 146 host(s)
Mar 15, 2024 @ 11:22:17.000    Filter AvailabilityZoneFilter returned 146 host(s)
Mar 15, 2024 @ 11:22:17.000    Filter AggregateMultiTenancyIsolation returned 146 host(s)
Mar 15, 2024 @ 11:22:17.000    Filter AggregateInstanceExtraSpecsFilter returned 146 host(s)

Not a single filter did anything for that instance even though compute nodes are disabled to test.

So we are extremely confused on what is going on with that instance which not a single other instance is experiencing.
I would also like to emphasize that the cloud is running with this exact setup (single ceph cluster per az) since Pike and we are currently on Yoga and never saw this behaviour before.

Best Regards
Marc Vorwerk + Maximilian Stinsky



Am Mittwoch, März 20, 2024 14:20 CET, schrieb smooney@redhat.com:
 
On Wed, 2024-03-20 at 13:06 +0100, Tobias Urdin wrote:
> Hello,
>
> This sounds familiar.
>
> If no availability zone was selected when the instance was spawned the “request spec” (saved in the database) does not
> contain a availability zone
> set and the scheduler will allow that instance to be scheduled to another availability zone because the original
> request did not include a specific availability zone.

correct live and cold migration is fully supproted between avaiablity zones provided the operator when installing nova
has exchanged ssh keys across all nodes and has not placed a firewall or similar between them

as you said if an instnace did not request an az when created, and it was not added by the schduler or a volume (with
cross_az_attach=false.) then the requqst_spec will not have an az. schduling by design does not consider the az that the
instance is currently on only the one in the request spec.

cross az migration is a core feature of nova not a bug and is expected to work by default in any deployment unless the
operator has taken messurs to prevent it. AZ in openstack are not fault domains and are not comparable to AWS avaiablity
zones. an AWS avaiablity zone is closer to a keyston region then it is to an nova AZ.
>
> If you search for “request spec” on the mailing list you’ll see that there has been multiple threads about that with a
> lot of
> details that will help you out.
in this cycle we added the ablity to view the pinned az form the request spec to make understanding this easier.
going forward if you use the latest microversion 2.96 instance list and instance show will contain an addtional filed
detailing the requsted az if one is set in the request spec.
https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#maximum-in-2024-1-caracal

>
> When we “migrated” to using availability zones we specifically populated this data in the database (note that it’s an
> unsupported change so be careful).
yes it is but if done correctly it shoudl not directly break anything. it may be unexpected form a user point of view
and then can use shelve to do a cross az unshleve now so they still have a way to force it to change if they need too
but the main danager in doign that is this is stored in a json blob in the db its easy to mess up the formating of that
blob and result in nova being unable to read it. if you do do this (and im not encuraging poeple to do this this) then
if your mysql or postgress is new enough they now have funciton for working with json and that can be safer to use to
update the blob in the db then was previously possible.

just be sure to take a db backup before making changes like this if you do try.
>
> Best regards
> Tobias
>
> > On 20 Mar 2024, at 12:51, Marc Vorwerk <marc+openstack@marc-vorwerk.de> wrote:
> >
> > Dear OpenStack Community,
> >
> > I am reaching out for support with an issue that is specifically affecting a single instance during migration or
> > resize operations in our OpenStack environment. I want to emphasize that this problem is isolated and does not
> > reflect a broader issue within our cloud setup.
> >
> > The issue arises when attempting a resize of the instance's flavor, which only differs in RAM+CPU specification.
> > Unexpectedly, the instance attempts to switch its availability zone from az1 to az2, which is not the intended
> > behavior.
As noted above this is a feature not a bug. the ability for an unpined instance to change az at scheduling time is
an intended behavior that is often not expect by people coming form aws but is expect to work by long time openstack
users and operators.
> >
> > The instance entered an error state during the resize or migration process, with a fault message indicating
> > 'ImageNotFound', because after the availability zone change the volume cant be reached. We use seperate ceph
> > clusters per az.
if this was a cinder volume then that indicates that indicates that you have not correctly configured your cluster
by default nova expect that all cinder backbend are accessible by all hosts, incindentally nova also expect the same
to be true fo all neutron networks by default.
where that is not the case for cinder volume you need to set [cinder]cross_az_attach=false
https://docs.openstack.org/nova/latest/configuration/config.html#cinder.cross_az_attach
that default to true as there is no expecation in general that nova aviabality zones align in any way to cinder
aviabality zones.

if you chosoe to make them align then you can use that option to enforce affinity but that is not expected to be the
case in general.

there are no az affintiy config option for neutron as a gain neutron netwroks are expecteed to span all hosts.
if you use the l3 routed network feature in neutron you can create an affintiy betwen l3 segments and host via physnets
howver that has no relationship to azs.

AZ in nova cinder and neutron are not modeling the same thing and while they can aline are not requred to as i said
above.

for images_type=rbd thre is no native schduling supprot to prevent you moving betwen backends.
we have discussed ways to do that in the past but never implemnted it.
if you want to prevent host changign cluster with the scheduelr today when using images_type=rbd you have 2 options

1.) you can model each ceph cluster as a seperate nova cell, by default we do not allow instance ot change cell so if
you align you ceph clusters to cell boundaryies then instance will never be schduled to a host connected to a diffent
ceph cluster.

2.) the other option is to manually configre the sechueler to enforce this
there are several ways to do this via a schduler filter and host aggrate metadta to map a flavor/image/tenant to a host
aggrate, alternitivaly you can use the required traits functionality of placement via the isolating aggreats feature.
https://docs.openstack.org/nova/latest/reference/isolate-aggregates.html
effectivly you can use the provider.yaml that advertises CUSTOM_CEPH_CLUSER_1 or CUSTOM_CEPH_CLUSER_2 on the relevent
hots vai provider.yaml https://docs.openstack.org/nova/latest/admin/managing-resource-providers.html
then you would create a host aggate per set of host to encorece the required custom trait
and modify your flavor/images to request the relevent trait.

unfortunately both approaches are hard to impalement for existing deployments, this is really something best planned for
and executed when you are commissioning a cloud for the first time.

what i have wanted to do for a long time but not had the time to propose or implement is have nova model the ceph
cluster and storage backend in use in placement so we can automatically schdule on it.
i more or less know what woudl be required to do that but while this is an ocational pain point for oeprators
its a long understood limitation and not wone that has been prioritised to address.

if this is something that people are interested in seeing adressed for images_type=rbd specificaly then feedback
form operators that they care about this would be appricated but i cannot commit to adressing this in the short to
medium term.
for now my recomendtaion is if you are deployin a new cloud and have image_type=rbd and you plan to have multiple ceph
clusters that are not asscabel by all hosts then you should create one nova cell per ceph cluster.
cells are relitivly cheap to create in nova, you can share the same database server/rabbitmq instance between cells if
you are not using cells for scaling and you can change that after the fact if you later find you need to scale.
you can also colocate mutlile conductors on the scame host for diffent cells provided your instalation tool can
accomidate that. we do that in devstack and its perfectly fine to do in production. cells are primarly a
scaling/sharding mechanium in nova but and be helpful for this usecase too. if you do have one cell per ceph cluster
you can also create one AZ per cell if you want to allow end users to choose the cluster but that is optionaly

AZ can span cells and cells can contain multiple diffent az, both concepts are entirly unrelated in nova.
cells are an architecutal choose for scaling nova to 1000s of compute nodes
az are just a lable on a hosts aggate with no other meaning.
neither are falut domains but both are often incorrectly assumed to be.

cells should not be required for this usecase to automatically work out of the box but no one in the community has ever
had time to work on the correct long term solution to model storage backend in placement.
that is sad as that was one of the orgianl primary usecases that placement was created to solve.
> >
> > To debug this issue we enabled debug logs for the nova scheduler and found that the scheduler does not filter out
> > any node with any of our enabled filter plugins. As a quick test we disabled a couple of compute nodes and could
> > also verify that the ComputeFilter of the nova-scheduler still just returns the full list of all nodes in the
> > cluster. So it seems to us that for some reason the nova-scheduler just ignores all enabled filter plugins for
> > mentioned instance.
> > It's worth noting that other instances with the same flavor on the same compute node do not exhibit these issues,
> > highlighting the unique nature of this problem.
> >
> > Furthermore we checked all relevant database tables to see if for some reason something strange is saved for this
> > instance but it seems to us that the instance has exactly the same attributes as other instances on this node.
> >
> > We are seeking insights or suggestions from anyone who might have experienced similar issues or has knowledge of
> > potential causes and solutions. What specific logs or configuration details would be helpful for us to provide to
> > facilitate further diagnosis?
> >
> > We greatly appreciate any guidance or assistance the community can offer.
> >
> > Best regards,
> > Marc Vorwerk + Maximilian Stinsky
>