[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Sylvain Bauza sbauza at redhat.com
Wed Feb 24 09:37:07 UTC 2016

Le 24/02/2016 01:10, Jay Pipes a écrit :
> On 02/22/2016 04:23 AM, Sylvain Bauza wrote:
>> I won't argue against performance here. You made a very nice PoC for
>> testing scaling DB writes within a single python process and I trust
>> your findings. While I would be naturally preferring some shared-nothing
>> approach that can horizontally scale, one could mention that we can do
>> the same with Galera clusters.
> a) My benchmarks aren't single process comparisons. They are 
> multi-process benchmarks.

Sorry, was unclear. I meant that I trust you for benchmarking the 
read/write performance of the models you proposed. The 'single process' 
was meant to explain that you propose a python script for that, but sure 
you can provide a multi-process call.
Again, no doubt against your findings. TBH, I had litterally no time to 
resurrect my testbed for that, but I'll play with your scripts as soon 
as I have time, it's one of my priorities.

> b) The approach I've taken is indeed shared-nothing. The scheduler 
> processes do not share any data whatsoever.
> c) Galera isn't horizontally scalable. Never was, never will be. That 
> isn't its strong-suit. Galera is best for having a 
> synchronously-replicated database cluster that is incredibly easy to 
> manage and administer but it isn't a performance panacea. It's focus 
> is on availability not performance :)

Fair and valid point. I said Galera as an example for saying that 
operators can try to horizontally scale the DBMS if they need, vs. using 
the existing model of scaling the number of conductors if we keep the 
resource ownership by the compute nodes.

>> That said, most of the operators run a controller/compute situation
>> where all the services but the compute node are hosted on 1:N hosts.
>> Implementing the resource-providers-scheduler BP (and only that one)
>> will dramatically increase the number of writes we do on the scheduler
>> process (ie. on the "controller" - quoting because there is no notion of
>> a "controller" in Nova, it's just a deployment choice).
> Yup, no doubt about it. It won't increase the *total* number of writes 
> the system makes, just the concentration of those writes into the 
> scheduler processes. You are trading increased writes in the scheduler 
> for the challenges inherent in keeping a large distributed cache 
> system valid and fresh (which itself introduces a different kind of 
> writes).

No, it will increase the number of writes, because you're planning to 
write every time a request comes in, right ?
If so, it's scaling per requests, compared to the existing model where 
computes write their data every 60 secs, so it scales per compute.

Since the number of requests is at least one order of magnitude higher 
than the number of computes, my guts are that it'll write far more the DB.

>> That's a big game changer for operators who are currently capping their
>> capacity by adding more conductors. It would require them to do some DB
>> modifications to be able to scale their capacity. I'm not against that,
>> I just say it's a big thing that we need to consider and properly
>> communicate if agreed.
> Agreed completely. I will say, however, that on a 1600 compute node 
> simulation (~60K variably-sized instances), an untuned stock MySQL 5.6 
> database with 128MB InnoDB buffer pool size barely breaks a sweat on 
> my local machine.

Nice, that makes me very comfortable with your approach from a 
performance PoV :-)
Great work btw.

>>> > It can be alleviated by changing to a stand-alone high
>>>>  performance database.
>>> It doesn't need to be high-performance at all. In my benchmarks, a
>>> small-sized stock MySQL database server is able to fulfill thousands
>>> of placement queries and claim transactions per minute using
>>> completely isolated non-shared, non-caching scheduler processes.
>>> > And the cache refreshing is designed to be
>>>> replaced by to direct SQL queries according to resource-provider
>>>> scheduler spec [2].
>>> Yes, this is correct.
>>> > The performance bottlehead of shared-state scheduler
>>>> may come from the overwhelming update messages, it can also be
>>>> alleviated by changing to stand-alone distributed message queue and by
>>>> using the “MessagePipe” to merge messages.
>>> In terms of the number of messages used in each design, I see the
>>> following relationship:
>>> resource-providers < legacy < shared-state-scheduler
>>> would you agree with that?
>> True. But that's manageable by adding more conductors, right ? IMHO,
>> Nova performance is bound by the number of conductors you run and I like
>> that - because that's easy to increase the capacity.
>> Also, the payload could be far smaller from the existing : instead of
>> sending a full update for a single compute_node entry, it would only
>> send the diff (+ some full syncs periodically). We would then mitigate
>> the messages increase by making sure we're sending less per message.
> No message sent is better than sending any message, regardless of 
> whether that message contains an incremental update or a full object.

Well, it's a tautology :-)
I'm just wanting to explain that having a "eventually-consistent" 
scheduler is somehow manageable by having the same scaling factor that 
we currently have (ie. number of conductors).
>>> The resource-providers proposal actually uses no update messages at
>>> all (except in the abnormal case of a compute node failing to start
>>> the resources that had previously been claimed by the scheduler). All
>>> updates are done in a single database transaction when the claim is 
>>> made.
>> See, I don't think that a compute node unable to start a request is an
>> 'abnormal case'. There are many reasons why a request can't be honored
>> by the compute node :
>>   - for example, the scheduler doesn't own all the compute resources and
>> thus can miss some information : for example, say that you want to pin a
>> specific pCPU and this pCPU is already assigned. The scheduler doesn't
>> know *which* pCPUs are free, it only knows *how much* are free
>> That atomic transaction (pick a free pCPU and assign it to the instance)
>> is made on the compute manager not at the exact same time we're
>> decreasing resource usage for pCPUs (because it would be done in the
>> scheduler process).
> See my response to Chris Friesen about the above.
>>   - some "reserved" RAM or disk could be underestimated and
>> consequently, spawning a VM could be either taking fare more time than
>> planned (which would mean that it would be a suboptimal placement) or it
>> would fail which would issue a reschedule.
> Again, the above is an abnormal case.
> <snip>
>> It's a distributed problem. Neither the shared-state scheduler can have
>> an accurate view (because it will only guarantee that it will
>> *eventually* be consistent), nor the scheduler process can have an
>> accurate view (because the scheduler doesn't own the resource usage that
>> is made by compute nodes)
> If the scheduler claimed the resources on the compute node, it *would* 
> own those resources, though, and therefore it *would* have a 99.9% 
> accurate view of the inventory of the system.
> It's not a distributed problem if you don't make it a distributed 
> problem.
> There's nothing wrong with the compute node having the final "say" 
> about if some claimed-by-the-scheduler resources truly cannot be 
> provided by the compute node, but again, that is a) an abnormal 
> operating condition and b) simply would trigger a notification to the 
> scheduler to free said resources.

So, I got your point. I just need to consider more if that's an abnormal 
and thus exceptional case, or if it's part of the design, because that's 
where the truth is.
I don't know yet, I just mean that this can't be easily tackled by 
saying "meh, it's non significant".

>>>> 4. Design goal difference:
>>>> The fundamental design goal of the two new schedulers is different. 
>>>> Copy
>>>> my views from [2], I think it is the choice between “the loose
>>>> distributed consistency with retries” and “the strict centralized
>>>> consistency with locks”.
>>> There are a couple other things that I believe we should be
>>> documenting, considering and measuring with regards to scheduler 
>>> designs:
>>> a) Debuggability
>>> The ability of a system to be debugged and for requests to that system
>>> to be diagnosed is a critical component to the benefit of a particular
>>> system design. I'm hoping that by removing a lot of the moving parts
>>> from the legacy filter scheduler design (removing the caching,
>>> removing the Python-side filtering and weighing, removing the interval
>>> between which placement decisions can conflict, removing the cost and
>>> frequency of retry operations) that the resource-provider scheduler
>>> design will become simpler for operators to use.
>> Keep in mind that scheduler filters are a very handy pluggable system
>> for operators wanting to implement their own placement logic. If you
>> want to deprecate filters (and I'm not opiniated on that), just make
>> sure that you keep that extension capability.
> Understood, and I will do my best to provide some level of extensibility.

I appreciate. Something unclear in my mind is : our extensible model is 
run per host, not as a full clause when getting the list of compute nodes.
Getting what will be part of the initial filtering clause and what will 
be kept as filters will be greatly appreciated, I guess.

>>> b) Simplicity
>>> Goes to the above point about debuggability, but I've always tried to
>>> follow the mantra that the best software design is not when you've
>>> added the last piece to it, but rather when you've removed the last
>>> piece from it and still have a functioning and performant system.
>>> Having a scheduler that can tackle the process of tracking resources,
>>> deciding on placement, and claiming those resources instead of playing
>>> an intricate dance of keeping state caches valid will, IMHO, lead to a
>>> better scheduler.
>> I'm also very concerned by the upgrade path for both proposals like I
>> said before. I haven't yet seen how either of those 2 proposals are
>> changing the existing FilterScheduler and what is subject to change.
>> Also, please keep in mind that we support rolling upgrades, so we need
>> to support old compute nodes.
> Nothing about any of the resource-providers specs inhibits rolling 
> upgrades. In fact, two of the specs (resource-providers-allocations 
> and compute-node-inventory) are *only* about how to do online data 
> migration from the legacy DB schema to the resource-providers schema. :)

Sorry, my concern about rolling upgrades was not aimed to your propsal, 
but rather to the 'eventually-consistent' BP.
My first phrase about what will change in the FilterScheduler is somehow 
related to my previous comment, where I'm unclear about what will change 
in the scheduler, and what won't.

 From my understanding, your proposal is to refine the existing dummy 
ComputeNodeList.get_all() call we make in the HostManager, and instead 
provide an object call which would eventually get the list of compute 
nodes that are having enough disk, RAM and CPU.

In that case, I suppose you plan to deprecate RAMFilter, DiskFilter and 
CoreFilter. Fair enough, but are you planning to deprecate 
AggregateRAMFilter, AggregateCoreFilter and other filters like NUMA ones ?


> Best,
> -jay

More information about the OpenStack-dev mailing list