[openstack-dev] [nova] Better tests for nova scheduler(esp. race conditions)?

Balázs Gibizer balazs.gibizer at ericsson.com
Thu Jan 28 09:23:44 UTC 2016


> -----Original Message-----
> From: Cheng, Yingxin [mailto:yingxin.cheng at intel.com]
> Sent: January 28, 2016 04:21
> To: OpenStack Development Mailing List (not for usage questions)
> Subject: Re: [openstack-dev] [nova] Better tests for nova scheduler(esp.
> race conditions)?
> 
> Thank you Nikola! I'm very interested in this.
> 
> 
> According to my current understanding, a complete functional test for nova
> scheduler should include nova-api, the scheduler service, part of conductor
> service which forward scheduler decisions to compute services, and the part
> of compute service including claim, claim abort and compute node resource
> consumption inside resource tracker.
> 
> The inputs of this series of tests are the initial resource view, existing
> resource consumptions from fake instances and the coming schedule
> requests with flavors.
> 
> The outputs are the statistics of elapsed time in every schedule phases, the
> statistics of requests' lifecycles, and the sanity of final resource view with
> booted fake instances.
> 
> Extra features should also be taken into consideration including, but not
> limited to, image properties, host aggregates, availability zones, compute
> capabilities, servergroups, compute service status, forced hosts, metrics etc.
> 
> Please correct me if anything wrong, I also want to know the existing
> decisions/ideas from mid-cycle sprint.
> 
> 
> I'll start from investigating existent functional test infrastructure, this could
> be much quicker if anyone (maybe Sean Dague) can provide help with the
> introduction of existing features. I've also seem others showing interests in
> this area -- Chris Dent(cdent). It would be great to work with other
> experienced contributors in community.

I think  https://github.com/openstack/nova/blob/master/nova/tests/functional/test_server_group.py 
is a good place to start. It has real api, real conductor, real scheduler services. The compute services are 
using fakelibvirt. You can start up multiple compute services and boot up instances with various scheduling
scenarios. 

Cheers,
Gibi

> 
> 
> 
> Regards,
> -Yingxin
> 
> 
> > -----Original Message-----
> > From: Nikola Đipanov [mailto:ndipanov at redhat.com]
> > Sent: Wednesday, January 27, 2016 9:58 PM
> > To: OpenStack Development Mailing List (not for usage questions)
> > Cc: Cheng, Yingxin
> > Subject: Re: [openstack-dev] [nova] Better tests for nova
> > scheduler(esp. race conditions)?
> >
> > Top posting since better scheduler testing just got brought up during
> > the midcycle meetup, so it might be useful to re-kindle this thread.
> >
> > Sean (Dague) brought up that there is some infrastructure already that
> > could help us do what you propose bellow, but work may be needed to
> > make it viable for proper reasource accounting tests.
> >
> > Yingxin - in case you are still interested in doing some of this
> > stuff, we can discuss here or on IRC.
> >
> > Thanks,
> > Nikola
> >
> > On 12/15/2015 03:33 AM, Cheng, Yingxin wrote:
> > >
> > >> -----Original Message-----
> > >> From: Nikola Đipanov [mailto:ndipanov at redhat.com]
> > >> Sent: Monday, December 14, 2015 11:11 PM
> > >> To: OpenStack Development Mailing List (not for usage questions)
> > >> Subject: Re: [openstack-dev] [nova] Better tests for nova
> > >> scheduler(esp. race conditions)?
> > >>
> > >> On 12/14/2015 08:20 AM, Cheng, Yingxin wrote:
> > >>> Hi All,
> > >>>
> > >>>
> > >>>
> > >>> When I was looking at bugs related to race conditions of scheduler
> > >>> [1-3], it feels like nova scheduler lacks sanity checks of
> > >>> schedule decisions according to different situations. We cannot
> > >>> even make sure that some fixes successfully mitigate race
> > >>> conditions to an acceptable scale. For example, there is no easy
> > >>> way to test whether server-group race conditions still exists
> > >>> after a fix for bug[1], or to make sure that after scheduling
> > >>> there will be no violations of allocation ratios reported by
> > >>> bug[2], or to test that the retry rate is acceptable in various
> > >>> corner cases proposed by bug[3]. And there will be much more in this
> list.
> > >>>
> > >>>
> > >>>
> > >>> So I'm asking whether there is a plan to add those tests in the
> > >>> future, or is there a design exist to simplify writing and
> > >>> executing those kinds of tests? I'm thinking of using fake
> > >>> databases and fake interfaces to isolate the entire scheduler
> > >>> service, so that we can easily build up a disposable environment
> > >>> with all kinds of fake resources and fake compute nodes to test
> > >>> scheduler behaviors. It is even a good way to test whether
> > >>> scheduler is capable to scale to 10k nodes without setting up 10k real
> compute nodes.
> > >>>
> > >>
> > >> This would be a useful effort - however do not assume that this is
> > >> going to be an easy task. Even in the paragraph above, you fail to
> > >> take into account that in order to test the scheduling you also
> > >> need to run all compute services since claims work like a kind of 2
> > >> phase commit where a scheduling decision gets checked on the
> > >> destination compute host (through Claims logic), which involves
> > >> locking in each compute
> > process.
> > >>
> > >
> > > Yes, the final goal is to test the entire scheduling process including 2PC.
> > > As scheduler is still in the process to be decoupled, some parts
> > > such as RT and retry mechanism are highly coupled with nova, thus
> > > IMO it is not a good idea to include them in this stage. Thus I'll
> > > try to isolate filter-scheduler as the first step, hope to be supported by
> community.
> > >
> > >
> > >>>
> > >>>
> > >>> I'm also interested in the bp[4] to reduce scheduler race
> > >>> conditions in green-thread level. I think it is a good start point
> > >>> in solving the huge racing problem of nova scheduler, and I really
> > >>> wish I could help on
> > that.
> > >>>
> > >>
> > >> I proposed said blueprint but am very unlikely to have any time to
> > >> work on it this cycle, so feel free to take a stab at it. I'd be
> > >> more than happy to prioritize any reviews related to the above BP.
> > >>
> > >> Thanks for your interest in this
> > >>
> > >> N.
> > >>
> > >
> > > Many thanks nikola! I'm still looking at the claim logic and try to
> > > find a way to merge it with scheduler host state, will upload
> > > patches as soon
> > as I figure it out.
> > >
> > >
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> [1] https://bugs.launchpad.net/nova/+bug/1423648
> > >>>
> > >>> [2] https://bugs.launchpad.net/nova/+bug/1370207
> > >>>
> > >>> [3] https://bugs.launchpad.net/nova/+bug/1341420
> > >>>
> > >>> [4]
> > >>> https://blueprints.launchpad.net/nova/+spec/host-state-level-locki
> > >>> ng
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Regards,
> > >>>
> > >>> -Yingxin
> > >>>
> > >
> > >
> > >
> > > Regards,
> > > -Yingxin
> > >
> > >
> __________________________________________________________
> ________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-
> request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


More information about the OpenStack-dev mailing list