[openstack-dev] [nova] Better tests for nova scheduler(esp. race conditions)?
Nikola Đipanov
ndipanov at redhat.com
Tue Dec 15 10:03:13 UTC 2015
On 12/15/2015 03:33 AM, Cheng, Yingxin wrote:
>
>> -----Original Message-----
>> From: Nikola Đipanov [mailto:ndipanov at redhat.com]
>> Sent: Monday, December 14, 2015 11:11 PM
>> To: OpenStack Development Mailing List (not for usage questions)
>> Subject: Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race
>> conditions)?
>>
>> On 12/14/2015 08:20 AM, Cheng, Yingxin wrote:
>>> Hi All,
>>>
>>>
>>>
>>> When I was looking at bugs related to race conditions of scheduler
>>> [1-3], it feels like nova scheduler lacks sanity checks of schedule
>>> decisions according to different situations. We cannot even make sure
>>> that some fixes successfully mitigate race conditions to an acceptable
>>> scale. For example, there is no easy way to test whether server-group
>>> race conditions still exists after a fix for bug[1], or to make sure
>>> that after scheduling there will be no violations of allocation ratios
>>> reported by bug[2], or to test that the retry rate is acceptable in
>>> various corner cases proposed by bug[3]. And there will be much more
>>> in this list.
>>>
>>>
>>>
>>> So I'm asking whether there is a plan to add those tests in the
>>> future, or is there a design exist to simplify writing and executing
>>> those kinds of tests? I'm thinking of using fake databases and fake
>>> interfaces to isolate the entire scheduler service, so that we can
>>> easily build up a disposable environment with all kinds of fake
>>> resources and fake compute nodes to test scheduler behaviors. It is
>>> even a good way to test whether scheduler is capable to scale to 10k
>>> nodes without setting up 10k real compute nodes.
>>>
>>
>> This would be a useful effort - however do not assume that this is going to be an
>> easy task. Even in the paragraph above, you fail to take into account that in
>> order to test the scheduling you also need to run all compute services since
>> claims work like a kind of 2 phase commit where a scheduling decision gets
>> checked on the destination compute host (through Claims logic), which involves
>> locking in each compute process.
>>
>
> Yes, the final goal is to test the entire scheduling process including 2PC.
> As scheduler is still in the process to be decoupled, some parts such as RT
> and retry mechanism are highly coupled with nova, thus IMO it is not a good idea to
> include them in this stage. Thus I'll try to isolate filter-scheduler as the first step,
> hope to be supported by community.
>
>
>>>
>>>
>>> I'm also interested in the bp[4] to reduce scheduler race conditions
>>> in green-thread level. I think it is a good start point in solving the
>>> huge racing problem of nova scheduler, and I really wish I could help on that.
>>>
>>
>> I proposed said blueprint but am very unlikely to have any time to work on it this
>> cycle, so feel free to take a stab at it. I'd be more than happy to prioritize any
>> reviews related to the above BP.
>>
>> Thanks for your interest in this
>>
>> N.
>>
>
> Many thanks nikola! I'm still looking at the claim logic and try to find a way to merge
> it with scheduler host state, will upload patches as soon as I figure it out.
>
Great!
Note that that step is not necessary - and indeed it may not be the best
place to start. We already have code duplication between the claims and
(what is only recently been renamed) consume_from_request, so removing
it is a nice to have but really not directly related to fixing the races.
Also after Sylvain's work here https://review.openstack.org/#/c/191251/
it will be trickoer to do as the scheduler side now used the RequestSpec
object instead of Instance, which is not sent over to compute nodes.
I'd personally leave that for last.
M.
>
>>>
>>>
>>>
>>>
>>> [1] https://bugs.launchpad.net/nova/+bug/1423648
>>>
>>> [2] https://bugs.launchpad.net/nova/+bug/1370207
>>>
>>> [3] https://bugs.launchpad.net/nova/+bug/1341420
>>>
>>> [4]
>>> https://blueprints.launchpad.net/nova/+spec/host-state-level-locking
>>>
>>>
>>>
>>>
>>>
>>> Regards,
>>>
>>> -Yingxin
>>>
>
>
>
> Regards,
> -Yingxin
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
More information about the OpenStack-dev
mailing list