[Openstack-operators] Update on Nova scheduler poor performance with Ironic

Mathieu Gagné mgagne at calavera.ca
Wed Aug 31 17:46:47 UTC 2016


On Wed, Aug 31, 2016 at 1:33 AM, Joshua Harlow <harlowja at fastmail.com> wrote:
>>
>> Enabling this option will make it so Nova scheduler loads instance
>> info asynchronously at start up. Depending on the number of
>> hypervisors and instances, it can take several minutes. (we are
>> talking about 10-15 minutes with 600+ Ironic nodes, or ~1s per node in
>> our case)
>
>
> This feels like a classic thing that could just be made better by a
> scatter/gather (in threads or other?) to the database or other service. 1s
> per node seems ummm, sorta bad and/or non-optimal (I wonder if this is low
> hanging fruit to improve this). I can travel around the world 7.5 times in
> that amount of time (if I was a light beam, haha).

This behavior was only triggered under the following conditions:
- Nova Kilo
- scheduler_tracks_instance_changes=False

So someone installing the latest Nova version won't have this issue.
Furthermore, if you enable scheduler_tracks_instance_changes,
instances will be loaded asynchronously by chunk when nova-scheduler
starts. (10 compute nodes at a time) But Jim found that enabling this
config causes OOM errors.

So I investigated and found a very interesting bug presents if you run
Nova in the Ironic context or anything where a single nova-compute
process manages multiple or LOT of hypervisors. As explained
previously, Nova loads the list of instances per compute node to help
with placement decisions:
https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/host_manager.py#L590

Again, in Ironic context, a single nova-compute host manages ALL
instances. This means this specific line found in _add_instance_info
will load ALL instances managed by that single nova-compute host.
What's even funnier is that _add_instance_info is called from
get_all_host_states for every compute nodes (hypervisors), NOT
nova-compute host. This means if you have 2000 hypervisors (Ironic
nodes), this function will load 2000 instances per hypervisor found in
get_all_host_states, ending with an overall process loading 2000^2
rows from the database. Now I know why Jim Roll complained about OOM
error. objects.InstanceList.get_by_host_and_node should be used
instead, NOT objects.InstanceList.get_by_host. Will report this bug
soon.


>>
>> There is a lot of side-effects to using it though. For example:
>> - you can only run ONE nova-scheduler process since cache state won't
>> be shared between processes and you don't want instances to be
>> scheduled twice to the same node/hypervisor.
>
>
> Out of curiosity, do you have only one scheduler process active and passive
> scheduler process(es) idle waiting to become active if the other schedule
> dies? (pretty simply done via something like
> https://kazoo.readthedocs.io/en/latest/api/recipe/election.html) Or do you
> have some manual/other process that kicks off a new scheduler if the 'main'
> one dies?

We use the HA feature of our virtualization infrastructure to handle
failover. This is a compromise we are willing to accept for now. I
agree that now everybody has access to this kind of feature in their
infra.


>> 2) Run a single nova-compute service
>>
>> I strongly suggest you DO NOT run multiple nova-compute services. If
>> you do, you will have duplicated hypervisors loaded by the scheduler
>> and you could end up with conflicting scheduling. You will also have
>> twice as much hypervisors to load in the scheduler.
>
>
> This seems scary (whenever I hear run a single of anything in a *cloud*
> platform, that makes me shiver). It'd be nice if we at least recommended
> people run https://kazoo.readthedocs.io/en/latest/api/recipe/election.html
> or have some active/passive automatic election process to handle that single
> thing dying (which they usually do, at odd times of the night). Honestly I'd
> (personally) really like to get to the bottom of how we as a group of
> developers ever got to the place where software was released (and/or even
> recommended to be used) in a *cloud* platform that ever required only one of
> anything to be ran (that's crazy bonkers, and yes there is history here, but
> damn, it just feels rotten as all hell, for lack of better words).

Same as above. If nova-compute process stops, customers won't lose
access to their baremetal but won't be able to manage them (create,
start, stop). In our context, that's not something they do often. In
fact, we more often than not deliver the baremetal for them in their
projects/tenants and they pretty much never touch the API anyway.

Also there is this hash ring feature coming in latest Nova version.
Meanwhile we are happy with the compromise.


>> 3) Increase service_down_time
>>
>> If you have a lot of nodes, you might have to increase this value
>> which is set to 60 seconds by default. This value is used by the
>> ComputeFilter filter to exclude nodes it hasn't heard from. If it
>> takes more than 60 seconds to list the list of nodes, you might guess
>> what we will happen, the scheduler will reject all of them since node
>> info is already outdated when it finally hits the filtering steps. I
>> strongly suggest you tweak this setting, regardless of the use of
>> CachingScheduler.
>
>
> Same kind of feeling I had above also applies, something feels broken if
> such things have to be found by operators (I'm pretty sure yahoo when I was
> there saw something similar) and not by the developers making the software.
> If I could (and I know I really can't due to the community we work in) I'd
> very much have an equivalent of a retrospective around how these kinds of
> solutions got built and how they ended up getting released to the wider
> public with such flaws....

The bug got fixed by Jim Roll as pointed out earlier. So I think this
particular recommendation might not apply if you are using latest Nova
version.

Bugs happen ¯\_(ツ)_/¯ and it just happens that someone caught it when
using Ironic in Liberty. We would have caught it too if we paid more
attention to performance, did scaling tests and profiled the code a
bit more before complaining publicly.

But the other bug I found and mentioned above still exist.
Fortunately, it won't show in Ironic context anymore since Jim made it
so Ironic host manager never loads list of instances per node ; it's
something we don't care about with baremetal. But if you are running
Kilo, you are out of luck and will be hitting all this madness.


>> [1]
>> https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/host_manager.py#L589-L592
>> [2]
>> https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/host_manager.py#L65-L68
>> [3]
>> http://docs.openstack.org/developer/ironic/deploy/install-guide.html#configure-compute-to-use-the-bare-metal-service
>> [4]
>> https://github.com/openstack/nova/blob/282c257aff6b53a1b6bb4b4b034a670c450d19d8/nova/conf/scheduler.py#L166-L185
>> [5] https://bugs.launchpad.net/nova/+bug/1479124
>> [6] https://www.youtube.com/watch?v=BcHyiOdme2s
>> [7] https://gist.github.com/mgagne/1fbeca4c0b60af73f019bc2e21eb4a80

--
Mathieu



More information about the OpenStack-operators mailing list