[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Jay Pipes jaypipes at gmail.com
Wed Feb 24 00:10:46 UTC 2016

On 02/22/2016 04:23 AM, Sylvain Bauza wrote:
> I won't argue against performance here. You made a very nice PoC for
> testing scaling DB writes within a single python process and I trust
> your findings. While I would be naturally preferring some shared-nothing
> approach that can horizontally scale, one could mention that we can do
> the same with Galera clusters.

a) My benchmarks aren't single process comparisons. They are 
multi-process benchmarks.

b) The approach I've taken is indeed shared-nothing. The scheduler 
processes do not share any data whatsoever.

c) Galera isn't horizontally scalable. Never was, never will be. That 
isn't its strong-suit. Galera is best for having a 
synchronously-replicated database cluster that is incredibly easy to 
manage and administer but it isn't a performance panacea. It's focus is 
on availability not performance :)

> That said, most of the operators run a controller/compute situation
> where all the services but the compute node are hosted on 1:N hosts.
> Implementing the resource-providers-scheduler BP (and only that one)
> will dramatically increase the number of writes we do on the scheduler
> process (ie. on the "controller" - quoting because there is no notion of
> a "controller" in Nova, it's just a deployment choice).

Yup, no doubt about it. It won't increase the *total* number of writes 
the system makes, just the concentration of those writes into the 
scheduler processes. You are trading increased writes in the scheduler 
for the challenges inherent in keeping a large distributed cache system 
valid and fresh (which itself introduces a different kind of writes).

> That's a big game changer for operators who are currently capping their
> capacity by adding more conductors. It would require them to do some DB
> modifications to be able to scale their capacity. I'm not against that,
> I just say it's a big thing that we need to consider and properly
> communicate if agreed.

Agreed completely. I will say, however, that on a 1600 compute node 
simulation (~60K variably-sized instances), an untuned stock MySQL 5.6 
database with 128MB InnoDB buffer pool size barely breaks a sweat on my 
local machine.

>> > It can be alleviated by changing to a stand-alone high
>>>  performance database.
>> It doesn't need to be high-performance at all. In my benchmarks, a
>> small-sized stock MySQL database server is able to fulfill thousands
>> of placement queries and claim transactions per minute using
>> completely isolated non-shared, non-caching scheduler processes.
>> > And the cache refreshing is designed to be
>>> replaced by to direct SQL queries according to resource-provider
>>> scheduler spec [2].
>> Yes, this is correct.
>> > The performance bottlehead of shared-state scheduler
>>> may come from the overwhelming update messages, it can also be
>>> alleviated by changing to stand-alone distributed message queue and by
>>> using the “MessagePipe” to merge messages.
>> In terms of the number of messages used in each design, I see the
>> following relationship:
>> resource-providers < legacy < shared-state-scheduler
>> would you agree with that?
> True. But that's manageable by adding more conductors, right ? IMHO,
> Nova performance is bound by the number of conductors you run and I like
> that - because that's easy to increase the capacity.
> Also, the payload could be far smaller from the existing : instead of
> sending a full update for a single compute_node entry, it would only
> send the diff (+ some full syncs periodically). We would then mitigate
> the messages increase by making sure we're sending less per message.

No message sent is better than sending any message, regardless of 
whether that message contains an incremental update or a full object.

>> The resource-providers proposal actually uses no update messages at
>> all (except in the abnormal case of a compute node failing to start
>> the resources that had previously been claimed by the scheduler). All
>> updates are done in a single database transaction when the claim is made.
> See, I don't think that a compute node unable to start a request is an
> 'abnormal case'. There are many reasons why a request can't be honored
> by the compute node :
>   - for example, the scheduler doesn't own all the compute resources and
> thus can miss some information : for example, say that you want to pin a
> specific pCPU and this pCPU is already assigned. The scheduler doesn't
> know *which* pCPUs are free, it only knows *how much* are free
> That atomic transaction (pick a free pCPU and assign it to the instance)
> is made on the compute manager not at the exact same time we're
> decreasing resource usage for pCPUs (because it would be done in the
> scheduler process).

See my response to Chris Friesen about the above.

>   - some "reserved" RAM or disk could be underestimated and
> consequently, spawning a VM could be either taking fare more time than
> planned (which would mean that it would be a suboptimal placement) or it
> would fail which would issue a reschedule.

Again, the above is an abnormal case.


> It's a distributed problem. Neither the shared-state scheduler can have
> an accurate view (because it will only guarantee that it will
> *eventually* be consistent), nor the scheduler process can have an
> accurate view (because the scheduler doesn't own the resource usage that
> is made by compute nodes)

If the scheduler claimed the resources on the compute node, it *would* 
own those resources, though, and therefore it *would* have a 99.9% 
accurate view of the inventory of the system.

It's not a distributed problem if you don't make it a distributed problem.

There's nothing wrong with the compute node having the final "say" about 
if some claimed-by-the-scheduler resources truly cannot be provided by 
the compute node, but again, that is a) an abnormal operating condition 
and b) simply would trigger a notification to the scheduler to free said 

>>> 4. Design goal difference:
>>> The fundamental design goal of the two new schedulers is different. Copy
>>> my views from [2], I think it is the choice between “the loose
>>> distributed consistency with retries” and “the strict centralized
>>> consistency with locks”.
>> There are a couple other things that I believe we should be
>> documenting, considering and measuring with regards to scheduler designs:
>> a) Debuggability
>> The ability of a system to be debugged and for requests to that system
>> to be diagnosed is a critical component to the benefit of a particular
>> system design. I'm hoping that by removing a lot of the moving parts
>> from the legacy filter scheduler design (removing the caching,
>> removing the Python-side filtering and weighing, removing the interval
>> between which placement decisions can conflict, removing the cost and
>> frequency of retry operations) that the resource-provider scheduler
>> design will become simpler for operators to use.
> Keep in mind that scheduler filters are a very handy pluggable system
> for operators wanting to implement their own placement logic. If you
> want to deprecate filters (and I'm not opiniated on that), just make
> sure that you keep that extension capability.

Understood, and I will do my best to provide some level of extensibility.

>> b) Simplicity
>> Goes to the above point about debuggability, but I've always tried to
>> follow the mantra that the best software design is not when you've
>> added the last piece to it, but rather when you've removed the last
>> piece from it and still have a functioning and performant system.
>> Having a scheduler that can tackle the process of tracking resources,
>> deciding on placement, and claiming those resources instead of playing
>> an intricate dance of keeping state caches valid will, IMHO, lead to a
>> better scheduler.
> I'm also very concerned by the upgrade path for both proposals like I
> said before. I haven't yet seen how either of those 2 proposals are
> changing the existing FilterScheduler and what is subject to change.
> Also, please keep in mind that we support rolling upgrades, so we need
> to support old compute nodes.

Nothing about any of the resource-providers specs inhibits rolling 
upgrades. In fact, two of the specs (resource-providers-allocations and 
compute-node-inventory) are *only* about how to do online data migration 
from the legacy DB schema to the resource-providers schema. :)


More information about the OpenStack-dev mailing list