[kolla-ansible][nova]Problem with distribution of instance on servers

Tony Liu tonyliu0592 at hotmail.com
Wed Feb 16 16:37:01 UTC 2022

Totally understand the intention and risk.
What I'd expect is some way to 1) expose such failure for monitor system to detect it,
eg. by local nova-compute API or global nova-api, (we also collect and analyze logs,
it will trigger alarm when failure happens, but it would be easier to get such info from
API.), then operator will be able to jump in and fix it, 2) reset the failure flag to recover.

From: Sean Mooney <smooney at redhat.com>
Sent: February 16, 2022 04:45 AM
To: Tony Liu; Laurent Dumont
Cc: Franck VEDEL; openstack-discuss
Subject: Re: [kolla-ansible][nova]Problem with distribution of instance on servers

On Wed, 2022-02-16 at 02:35 +0000, Tony Liu wrote:
> Build failure could be caused by different things, networking, storage, hypervisor, etc.
> For example, failure caused by Neutron service, that doesn't mean this hypervisor is
> not healthy, but because of that weigher, even Neutron service is recovered, this
> hypervisor is still excluded from holding instance. This doesn't make sense.
> I wouldn't enable this weigher until it's smart enough to know the failure is caused
> by hypervisor itself, but not anywhere else.
this is enabled by default on all deployments and has been for many years at this point.
we stongly recommend that it is used.

you can elect to disable it but if you do you can end up with vms constantly being sechdluled to the same set of broken hosts
this become more apprent as the deployment get more full.

while you coudl reduce the weight of this filter it high multipler was conse so that it coudl override the votes of the other weighers.

we likely could imporve the weigher perhaps have it age our the failed builds to account for traisient failures
or provide a nova-manage command to allow operators to reset the value for a host or soemthign like that but
in a healthy cloud you should not get failed builds that land on a host rater then cell0

you can get failed builds where there is no host avaiable but those will land in cell0 and not affect the host failure count.
you can also get failed builds due to quota ectra but that is validated in the api before we try to build the instance so if you
are getting failed builds it shoudl be an indication that you have at least a trasient problem with your deployment that shoudl be fixed.
> Tony
> ________________________________________
> From: Laurent Dumont <laurentfdumont at gmail.com>
> Sent: February 15, 2022 05:00 PM
> To: Tony Liu
> Cc: Franck VEDEL; openstack-discuss
> Subject: Re: [kolla-ansible][nova]Problem with distribution of instance on servers
> In a healthy setup, should build_failure_weight_multiplier be triggered?
> From the doc, tweaking this might mean you try to schedule and built instances on computes that are not healthy.
> On Tue, Feb 15, 2022 at 6:38 PM Tony Liu <tonyliu0592 at hotmail.com<mailto:tonyliu0592 at hotmail.com>> wrote:
> Enable debug logging on nova-scheduler, you will see how the winner is picked.
> I had the same issue before, caused by the build-failure weigher enabled by default.
> setting build_failure_weight_multiplier to 0 resolved issue for me. Instances are
> balanced by weighers (compute and memory) as expected.
> shuffle_best_same_weighed_hosts and host_subset_size are not necessary, unless
> it's required by certain cases.
> Tony
> ________________________________________
> From: Laurent Dumont <laurentfdumont at gmail.com<mailto:laurentfdumont at gmail.com>>
> Sent: February 15, 2022 12:54 PM
> To: Franck VEDEL
> Cc: openstack-discuss
> Subject: Re: [kolla-ansible][nova]Problem with distribution of instance on servers
> There are two settings we've tweaked in the past in Nova.
> shuffle_best_same_weighed_hosts  --> Allow more spreading in the case of computes with the exact same specs/weights.
> host_subset_size --> Helps with concurrent requests to get different hosts
> Before that, we saw the same behavior with Openstack stacking VM on single computes. It still respects anti-affinity, but I don't see a good reason to not spread as a default. Changing these two was enough to allow our spread to get a little better.
> On Tue, Feb 15, 2022 at 11:19 AM Franck VEDEL <franck.vedel at univ-grenoble-alpes.fr<mailto:franck.vedel at univ-grenoble-alpes.fr><mailto:franck.vedel at univ-grenoble-alpes.fr<mailto:franck.vedel at univ-grenoble-alpes.fr>>> wrote:
> Hello,
> I seem to have a problem that I hadn't seen.
> I have 3 servers for my openstack, built with Kolla-ansible, I'm in Victoria version.
> I had simply put the 3 servers in the [compute] part of the multinode file, at first it worked, but for some time all the VMs are placed on server 1.
> The 3 servers are operational, identical. here are 3 screenshots to show it. (on the images, the instances on servers 2 and 3 are present because it worked correctly, but no more instances are created on these servers now)
> [cid:17eff2778356f37a4481]
> [cid:17eff277835e47aa83c2]
> [cid:17eff2778356f53d34a3]
> I tried to understand how the instances are distributed on the servers, but in my case, I don't understand why none are assigned to the 2nd and 3rd server.
> How to find the problem? It should be nova-scheduler . Do you have to do anything special? Go see if a parameter has a bad value?
> Thanks in advance if you can help me.
> Franck VEDEL

More information about the openstack-discuss mailing list