[openstack-dev] [nova] Better tests for nova scheduler(esp. race conditions)?

John Garbutt john at johngarbutt.com
Tue Dec 15 15:39:18 UTC 2015


On 15 December 2015 at 10:03, Nikola Đipanov <ndipanov at redhat.com> wrote:
> On 12/15/2015 03:33 AM, Cheng, Yingxin wrote:
>>
>>> -----Original Message-----
>>> From: Nikola Đipanov [mailto:ndipanov at redhat.com]
>>> Sent: Monday, December 14, 2015 11:11 PM
>>> To: OpenStack Development Mailing List (not for usage questions)
>>> Subject: Re: [openstack-dev] [nova] Better tests for nova scheduler(esp. race
>>> conditions)?
>>>
>>> On 12/14/2015 08:20 AM, Cheng, Yingxin wrote:
>>>> Hi All,
>>>>
>>>>
>>>>
>>>> When I was looking at bugs related to race conditions of scheduler
>>>> [1-3], it feels like nova scheduler lacks sanity checks of schedule
>>>> decisions according to different situations. We cannot even make sure
>>>> that some fixes successfully mitigate race conditions to an acceptable
>>>> scale. For example, there is no easy way to test whether server-group
>>>> race conditions still exists after a fix for bug[1], or to make sure
>>>> that after scheduling there will be no violations of allocation ratios
>>>> reported by bug[2], or to test that the retry rate is acceptable in
>>>> various corner cases proposed by bug[3]. And there will be much more
>>>> in this list.
>>>>
>>>>
>>>>
>>>> So I'm asking whether there is a plan to add those tests in the
>>>> future, or is there a design exist to simplify writing and executing
>>>> those kinds of tests? I'm thinking of using fake databases and fake
>>>> interfaces to isolate the entire scheduler service, so that we can
>>>> easily build up a disposable environment with all kinds of fake
>>>> resources and fake compute nodes to test scheduler behaviors. It is
>>>> even a good way to test whether scheduler is capable to scale to 10k
>>>> nodes without setting up 10k real compute nodes.
>>>>
>>>
>>> This would be a useful effort - however do not assume that this is going to be an
>>> easy task. Even in the paragraph above, you fail to take into account that in
>>> order to test the scheduling you also need to run all compute services since
>>> claims work like a kind of 2 phase commit where a scheduling decision gets
>>> checked on the destination compute host (through Claims logic), which involves
>>> locking in each compute process.
>>>
>>
>> Yes, the final goal is to test the entire scheduling process including 2PC.
>> As scheduler is still in the process to be decoupled, some parts such as RT
>> and retry mechanism are highly coupled with nova, thus IMO it is not a good idea to
>> include them in this stage. Thus I'll try to isolate filter-scheduler as the first step,
>> hope to be supported by community.
>>
>>
>>>>
>>>>
>>>> I'm also interested in the bp[4] to reduce scheduler race conditions
>>>> in green-thread level. I think it is a good start point in solving the
>>>> huge racing problem of nova scheduler, and I really wish I could help on that.
>>>>
>>>
>>> I proposed said blueprint but am very unlikely to have any time to work on it this
>>> cycle, so feel free to take a stab at it. I'd be more than happy to prioritize any
>>> reviews related to the above BP.
>>>
>>> Thanks for your interest in this
>>>
>>> N.
>>>
>>
>> Many thanks nikola! I'm still looking at the claim logic and try to find a way to merge
>> it with scheduler host state, will upload patches as soon as I figure it out.
>>
>
> Great!
>
> Note that that step is not necessary - and indeed it may not be the best
> place to start. We already have code duplication between the claims and
> (what is only recently been renamed) consume_from_request, so removing
> it is a nice to have but really not directly related to fixing the races.
>
> Also after Sylvain's work here https://review.openstack.org/#/c/191251/
> it will be trickoer to do as the scheduler side now used the RequestSpec
> object instead of Instance, which is not sent over to compute nodes.
>
> I'd personally leave that for last.

I would recommend you attend the scheduler sub team meetings, if at
all possible, or track what is discussed there:
http://eavesdrop.openstack.org/#Nova_Scheduler_Team_Meeting

There is a rough outline around the current direction of the scheduler work:
http://docs.openstack.org/developer/nova/scheduler_evolution.html
As ever, thats a little out of date right now, and doesn't capture all
the discussions around moving claims into the scheduler.

Thanks,
johnthetubaguy

> M.
>
>>
>>>>
>>>>
>>>>
>>>>
>>>> [1] https://bugs.launchpad.net/nova/+bug/1423648
>>>>
>>>> [2] https://bugs.launchpad.net/nova/+bug/1370207
>>>>
>>>> [3] https://bugs.launchpad.net/nova/+bug/1341420
>>>>
>>>> [4]
>>>> https://blueprints.launchpad.net/nova/+spec/host-state-level-locking
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> -Yingxin
>>>>
>>
>>
>>
>> Regards,
>> -Yingxin
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list