[Openstack-operators] Update on Nova scheduler poor performance with Ironic

Marc Heckmann marc.heckmann at ubisoft.com
Wed Aug 31 20:46:35 UTC 2016


Hi,

On Wed, 2016-08-31 at 13:46 -0400, Mathieu Gagné wrote:
> On Wed, Aug 31, 2016 at 1:33 AM, Joshua Harlow <harlowja at fastmail.com
> > wrote:
> > 
> > > 
> > > 
> > > Enabling this option will make it so Nova scheduler loads
> > > instance
> > > info asynchronously at start up. Depending on the number of
> > > hypervisors and instances, it can take several minutes. (we are
> > > talking about 10-15 minutes with 600+ Ironic nodes, or ~1s per
> > > node in
> > > our case)
> > 
> > This feels like a classic thing that could just be made better by a
> > scatter/gather (in threads or other?) to the database or other
> > service. 1s
> > per node seems ummm, sorta bad and/or non-optimal (I wonder if this
> > is low
> > hanging fruit to improve this). I can travel around the world 7.5
> > times in
> > that amount of time (if I was a light beam, haha).
> This behavior was only triggered under the following conditions:
> - Nova Kilo
> - scheduler_tracks_instance_changes=False
> 
> So someone installing the latest Nova version won't have this issue.
> Furthermore, if you enable scheduler_tracks_instance_changes,
> instances will be loaded asynchronously by chunk when nova-scheduler
> starts. (10 compute nodes at a time) But Jim found that enabling this
> config causes OOM errors.

Somewhat of thread hijack, but it's funny that this comes up now. We've
been getting OOMs on some our Liberty controllers in the past couple of
weeks in part because of Nova Scheduler memory usage (10GiB + right at
startup). 

We just now disabled "scheduler_tracks_instance_changes" and I confirm
that mem usage has become reasonable again.

I admit that we're having  a hard time figuring out exactly which
scheduler filters rely on the option though. 


 
> 
> So I investigated and found a very interesting bug presents if you
> run
> Nova in the Ironic context or anything where a single nova-compute
> process manages multiple or LOT of hypervisors. As explained
> previously, Nova loads the list of instances per compute node to help
> with placement decisions:
> https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/host_m
> anager.py#L590
> 
> Again, in Ironic context, a single nova-compute host manages ALL
> instances. This means this specific line found in _add_instance_info
> will load ALL instances managed by that single nova-compute host.
> What's even funnier is that _add_instance_info is called from
> get_all_host_states for every compute nodes (hypervisors), NOT
> nova-compute host. This means if you have 2000 hypervisors (Ironic
> nodes), this function will load 2000 instances per hypervisor found
> in
> get_all_host_states, ending with an overall process loading 2000^2
> rows from the database. Now I know why Jim Roll complained about OOM
> error. objects.InstanceList.get_by_host_and_node should be used
> instead, NOT objects.InstanceList.get_by_host. Will report this bug
> soon.
> 
> 
> > 
> > > 
> > > 
> > > There is a lot of side-effects to using it though. For example:
> > > - you can only run ONE nova-scheduler process since cache state
> > > won't
> > > be shared between processes and you don't want instances to be
> > > scheduled twice to the same node/hypervisor.
> > 
> > Out of curiosity, do you have only one scheduler process active and
> > passive
> > scheduler process(es) idle waiting to become active if the other
> > schedule
> > dies? (pretty simply done via something like
> > https://kazoo.readthedocs.io/en/latest/api/recipe/election.html) Or
> > do you
> > have some manual/other process that kicks off a new scheduler if
> > the 'main'
> > one dies?
> We use the HA feature of our virtualization infrastructure to handle
> failover. This is a compromise we are willing to accept for now. I
> agree that now everybody has access to this kind of feature in their
> infra.
> 
> 
> > 
> > > 
> > > 2) Run a single nova-compute service
> > > 
> > > I strongly suggest you DO NOT run multiple nova-compute services.
> > > If
> > > you do, you will have duplicated hypervisors loaded by the
> > > scheduler
> > > and you could end up with conflicting scheduling. You will also
> > > have
> > > twice as much hypervisors to load in the scheduler.
> > 
> > This seems scary (whenever I hear run a single of anything in a
> > *cloud*
> > platform, that makes me shiver). It'd be nice if we at least
> > recommended
> > people run https://kazoo.readthedocs.io/en/latest/api/recipe/electi
> > on.html
> > or have some active/passive automatic election process to handle
> > that single
> > thing dying (which they usually do, at odd times of the night).
> > Honestly I'd
> > (personally) really like to get to the bottom of how we as a group
> > of
> > developers ever got to the place where software was released
> > (and/or even
> > recommended to be used) in a *cloud* platform that ever required
> > only one of
> > anything to be ran (that's crazy bonkers, and yes there is history
> > here, but
> > damn, it just feels rotten as all hell, for lack of better words).
> Same as above. If nova-compute process stops, customers won't lose
> access to their baremetal but won't be able to manage them (create,
> start, stop). In our context, that's not something they do often. In
> fact, we more often than not deliver the baremetal for them in their
> projects/tenants and they pretty much never touch the API anyway.
> 
> Also there is this hash ring feature coming in latest Nova version.
> Meanwhile we are happy with the compromise.
> 
> 
> > 
> > > 
> > > 3) Increase service_down_time
> > > 
> > > If you have a lot of nodes, you might have to increase this value
> > > which is set to 60 seconds by default. This value is used by the
> > > ComputeFilter filter to exclude nodes it hasn't heard from. If it
> > > takes more than 60 seconds to list the list of nodes, you might
> > > guess
> > > what we will happen, the scheduler will reject all of them since
> > > node
> > > info is already outdated when it finally hits the filtering
> > > steps. I
> > > strongly suggest you tweak this setting, regardless of the use of
> > > CachingScheduler.
> > 
> > Same kind of feeling I had above also applies, something feels
> > broken if
> > such things have to be found by operators (I'm pretty sure yahoo
> > when I was
> > there saw something similar) and not by the developers making the
> > software.
> > If I could (and I know I really can't due to the community we work
> > in) I'd
> > very much have an equivalent of a retrospective around how these
> > kinds of
> > solutions got built and how they ended up getting released to the
> > wider
> > public with such flaws....
> The bug got fixed by Jim Roll as pointed out earlier. So I think this
> particular recommendation might not apply if you are using latest
> Nova
> version.
> 
> Bugs happen ¯\_(ツ)_/¯ and it just happens that someone caught it when
> using Ironic in Liberty. We would have caught it too if we paid more
> attention to performance, did scaling tests and profiled the code a
> bit more before complaining publicly.
> 
> But the other bug I found and mentioned above still exist.
> Fortunately, it won't show in Ironic context anymore since Jim made
> it
> so Ironic host manager never loads list of instances per node ; it's
> something we don't care about with baremetal. But if you are running
> Kilo, you are out of luck and will be hitting all this madness.
> 
> 
> > 
> > > 
> > > [1]
> > > https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/ho
> > > st_manager.py#L589-L592
> > > [2]
> > > https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/ho
> > > st_manager.py#L65-L68
> > > [3]
> > > http://docs.openstack.org/developer/ironic/deploy/install-guide.h
> > > tml#configure-compute-to-use-the-bare-metal-service
> > > [4]
> > > https://github.com/openstack/nova/blob/282c257aff6b53a1b6bb4b4b03
> > > 4a670c450d19d8/nova/conf/scheduler.py#L166-L185
> > > [5] https://bugs.launchpad.net/nova/+bug/1479124
> > > [6] https://www.youtube.com/watch?v=BcHyiOdme2s
> > > [7] https://gist.github.com/mgagne/1fbeca4c0b60af73f019bc2e21eb4a
> > > 80
> --
> Mathieu
> 
> _______________________________________________
> OpenStack-operators mailing list
> OpenStack-operators at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operato
> rs


More information about the OpenStack-operators mailing list