[openstack-dev] [nova] blueprint about multiple workers supported in nova-scheduler

Rui Chen chenrui.momo at gmail.com
Fri Mar 6 01:56:37 UTC 2015


Thank you very much for in-depth discussion about this topic, @Nikola and
@Sylvain.

I agree that we should solve the technical debt firstly, and then make the
scheduler better.

Best Regards.

2015-03-05 21:12 GMT+08:00 Sylvain Bauza <sbauza at redhat.com>:

>
> Le 05/03/2015 13:00, Nikola Đipanov a écrit :
>
>  On 03/04/2015 09:23 AM, Sylvain Bauza wrote:
>>
>>> Le 04/03/2015 04:51, Rui Chen a écrit :
>>>
>>>> Hi all,
>>>>
>>>> I want to make it easy to launch a bunch of scheduler processes on a
>>>> host, multiple scheduler workers will make use of multiple processors
>>>> of host and enhance the performance of nova-scheduler.
>>>>
>>>> I had registered a blueprint and commit a patch to implement it.
>>>> https://blueprints.launchpad.net/nova/+spec/scheduler-
>>>> multiple-workers-support
>>>>
>>>> This patch had applied in our performance environment and pass some
>>>> test cases, like: concurrent booting multiple instances, currently we
>>>> didn't find inconsistent issue.
>>>>
>>>> IMO, nova-scheduler should been scaled horizontally on easily way, the
>>>> multiple workers should been supported as an out of box feature.
>>>>
>>>> Please feel free to discuss this feature, thanks.
>>>>
>>>
>>> As I said when reviewing your patch, I think the problem is not just
>>> making sure that the scheduler is thread-safe, it's more about how the
>>> Scheduler is accounting resources and providing a retry if those
>>> consumed resources are higher than what's available.
>>>
>>> Here, the main problem is that two workers can actually consume two
>>> distinct resources on the same HostState object. In that case, the
>>> HostState object is decremented by the number of taken resources (modulo
>>> what means a resource which is not an Integer...) for both, but nowhere
>>> in that section, it does check that it overrides the resource usage. As
>>> I said, it's not just about decorating a semaphore, it's more about
>>> rethinking how the Scheduler is managing its resources.
>>>
>>>
>>> That's why I'm -1 on your patch until [1] gets merged. Once this BP will
>>> be implemented, we will have a set of classes for managing heterogeneous
>>> types of resouces and consume them, so it would be quite easy to provide
>>> a check against them in the consume_from_instance() method.
>>>
>>>  I feel that the above explanation does not give the full picture in
>> addition to being factually incorrect in several places. I have come to
>> realize that the current behaviour of the scheduler is subtle enough
>> that just reading the code is not enough to understand all the edge
>> cases that can come up. The evidence being that it trips up even people
>> that have spent significant time working on the code.
>>
>> It is also important to consider the design choices in terms of
>> tradeoffs that they were trying to make.
>>
>> So here are some facts about the way Nova does scheduling of instances
>> to compute hosts, considering the amount of resources requested by the
>> flavor (we will try to put the facts into a bigger picture later):
>>
>> * Scheduler receives request to chose hosts for one or more instances.
>> * Upon every request (_not_ for every instance as there may be several
>> instances in a request) the scheduler learns the state of the resources
>> on all compute nodes from the central DB. This state may be inaccurate
>> (meaning out of date).
>> * Compute resources are update by each compute host periodically. This
>> is done by updating the row in the DB.
>> * The wall-clock time difference between the scheduler deciding to
>> schedule an instance, and the resource consumption being reflected in
>> the data the scheduler learns from the DB can be arbitrarily long (due
>> to load on the compute nodes and latency of message arrival).
>> * To cope with the above, there is a concept of retrying the request
>> that fails on a certain compute node due to the scheduling decision
>> being made with data stale at the moment of build, by default we will
>> retry 3 times before giving up.
>> * When running multiple instances, decisions are made in a loop, and
>> internal in-memory view of the resources gets updated (the widely
>> misunderstood consume_from_instance method is used for this), so as to
>> keep subsequent decisions as accurate as possible. As was described
>> above, this is all thrown away once the request is finished.
>>
>> Now that we understand the above, we can start to consider what changes
>> when we introduce several concurrent scheduler processes.
>>
>> Several cases come to mind:
>> * Concurrent requests will no longer be serialized on reading the state
>> of all hosts (due to how eventlet interacts with mysql driver).
>> * In the presence of a single request for a large number of instances
>> there is going to be a drift in accuracy of the decisions made by other
>> schedulers as they will not have the accounted for any of the instances
>> until they actually get claimed on their respective hosts.
>>
>> All of the above limitations will likely not pose a problem under normal
>> load and usage and can cause issues to start appearing when nodes are
>> close to full or when there is heavy load. Also this changes drastically
>> based on how we actually chose to utilize hosts (see a very interesting
>> Ironic bug [1])
>>
>> Weather any of the above matters to users is dependant heavily on their
>> use-case though. This is why I feel we should be providing more
>> information.
>>
>> Finally - I think it is important to accept that the scheduler service
>> will always have to operate under the assumptions of stale data, and
>> build for that. Based on that I'd be happy to see real work go into
>> making multiple schedulers work well enough for most common use-cases
>> while providing a way forward for people who need tighter bounds on the
>> feedback loop.
>>
>> N.
>>
>
> Agreed 100% with all your above email. Thanks Nikola for giving time on
> explaining how the Scheduler is working, that's (btw.) something I hope to
> be presenting for the Vancouver Summit if my proposal is accepted.
>
> That said, I hope my reviewers will understand that I would want to see
> first the Scheduler being splitted and being on a separate repo before
> working on fixing the race conditions you mentioned above. Yes, I know,
> it's difficult to accept some limitations on the Nova scheduler while many
> customers would want them to be fixed, but here we have so many technical
> debt issues that I think we should really work on the split itself (like we
> did for Kilo and what we'll hopefully work for Liberty) and then discuss on
> the new design once after that.
>
> -Sylvain
>
>
>
>  [1] https://bugs.launchpad.net/nova/+bug/1341420
>>
>> ____________________________________________________________
>> ______________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:
>> unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150306/f2f423ba/attachment.html>


More information about the OpenStack-dev mailing list