[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"
yingxin.cheng at intel.com
Wed Feb 24 09:03:45 UTC 2016
Very sorry for the delay, it feels hard for me to reply to all the concerns raised, most of you have years more experiences. I've tried hard to present that there is a direction to solve the issues of existing filter scheduler in multiple areas including performance, decision accuracy, multiple scheduler support and race condition. I'll also support any other solutions if they can solve the same issue elegantly.
I feel that scheduler team will agree with the design that can fulfill thousands of placement queries with thousands of nodes. But as a scheduler that will be splitted out to be used in wider areas, it's not simple to predict the that requirement, so I'm not agree with the statement "It doesn't need to be high-performance at all". There is no system that can be existed without performance bottleneck, including resource-provider scheduler and shared-state scheduler. I was trying to point out where is the potential bottleneck in each design and how to improve if the worst thing happens, quote:
"The performance bottleneck of resource-provider and legacy scheduler is from the centralized db (REMOVED: and scheduler cache refreshing). It can be alleviated by changing to a stand-alone high performance database. And the cache refreshing is designed to be replaced by to direct SQL queries according to resource-provider scheduler spec. The performance bottleneck of shared-state scheduler may come from the overwhelming update messages, it can also be alleviated by changing to stand-alone distributed message queue and by using the "MessagePipe" to merge messages."
I'm not saying that there is a bottleneck of resource-provider scheduler in fulfilling current design goal. The ability of resource-provider scheduler is already proven by a nice modeling tool implemented by Jay, I trust it. But I care more about the actual limit of each design and how easily they can be extended to increase that limit. That's why I turned to make efforts in scheduler functional test framework(https://review.openstack.org/#/c/281825/). I finally want to test scheduler functionality using greenthreads in the gate, and test the performance and placement of each design using real processes. And I hope the scheduler can open to both centralized and distributed design.
I've updated my understanding of three designs: https://docs.google.com/document/d/1iNUkmnfEH3bJHDSenmwE4A1Sgm3vnqa6oZJWtTumAkw/edit?usp=sharing
The "cache updates" arrows are changed to "resource updates" in resource-provider scheduler, because I think resource updates from virt driver are still needed to be updated to the central database. Hope this time it's right.
As the first step towards shared-state scheduler, the required changes are kept at minimum. There are no db modifications needed, so no rolling upgrade issues in data migration. The new scheduler can decide not to make decisions to old compute nodes, or try to refresh host states from db and use legacy way to make decisions until all the compute nodes are upgraded.
I have to admit that my prototype still lack efforts to deal with overwhelming messages. This design works best using the distributed message queue. Also if we try to initiate multiple scheduler processes/workers in a single host, there are a lot more to be done to reduce update messages between compute nodes and scheduler workers. But I see the potential of distributed resource management/scheduling and would like to make efforts in this direction.
If we are agreed that the final decision accuracy are guaranteed in both directions, we should care more about the final decision throughput of both design. Theoretically it is better because the final consumptions are made distributedly, but there exists difficulties in reaching that limit. However, the centralized design is easier to approach its theoretical performance because of the lightweight implementation inside scheduler and the powerful underlying database.
> -----Original Message-----
> From: Jay Pipes [mailto:jaypipes at gmail.com]
> Sent: Wednesday, February 24, 2016 8:11 AM
> To: Sylvain Bauza <sbauza at redhat.com>; OpenStack Development Mailing List
> (not for usage questions) <openstack-dev at lists.openstack.org>; Cheng, Yingxin
> <yingxin.cheng at intel.com>
> Subject: Re: [openstack-dev] [nova] A prototype implementation towards the
> "shared state scheduler"
> On 02/22/2016 04:23 AM, Sylvain Bauza wrote:
> > I won't argue against performance here. You made a very nice PoC for
> > testing scaling DB writes within a single python process and I trust
> > your findings. While I would be naturally preferring some
> > shared-nothing approach that can horizontally scale, one could mention
> > that we can do the same with Galera clusters.
> a) My benchmarks aren't single process comparisons. They are multi-process
> b) The approach I've taken is indeed shared-nothing. The scheduler processes do
> not share any data whatsoever.
> c) Galera isn't horizontally scalable. Never was, never will be. That isn't its
> strong-suit. Galera is best for having a synchronously-replicated database cluster
> that is incredibly easy to manage and administer but it isn't a performance
> panacea. It's focus is on availability not performance :)
> > That said, most of the operators run a controller/compute situation
> > where all the services but the compute node are hosted on 1:N hosts.
> > Implementing the resource-providers-scheduler BP (and only that one)
> > will dramatically increase the number of writes we do on the scheduler
> > process (ie. on the "controller" - quoting because there is no notion
> > of a "controller" in Nova, it's just a deployment choice).
> Yup, no doubt about it. It won't increase the *total* number of writes the
> system makes, just the concentration of those writes into the scheduler
> processes. You are trading increased writes in the scheduler for the challenges
> inherent in keeping a large distributed cache system valid and fresh (which itself
> introduces a different kind of writes).
> > That's a big game changer for operators who are currently capping
> > their capacity by adding more conductors. It would require them to do
> > some DB modifications to be able to scale their capacity. I'm not
> > against that, I just say it's a big thing that we need to consider and
> > properly communicate if agreed.
> Agreed completely. I will say, however, that on a 1600 compute node simulation
> (~60K variably-sized instances), an untuned stock MySQL 5.6 database with
> 128MB InnoDB buffer pool size barely breaks a sweat on my local machine.
> >> > It can be alleviated by changing to a stand-alone high
> >>> performance database.
> >> It doesn't need to be high-performance at all. In my benchmarks, a
> >> small-sized stock MySQL database server is able to fulfill thousands
> >> of placement queries and claim transactions per minute using
> >> completely isolated non-shared, non-caching scheduler processes.
> >> > And the cache refreshing is designed to be
> >>> replaced by to direct SQL queries according to resource-provider
> >>> scheduler spec .
> >> Yes, this is correct.
> >> > The performance bottlehead of shared-state scheduler
> >>> may come from the overwhelming update messages, it can also be
> >>> alleviated by changing to stand-alone distributed message queue and
> >>> by using the "MessagePipe" to merge messages.
> >> In terms of the number of messages used in each design, I see the
> >> following relationship:
> >> resource-providers < legacy < shared-state-scheduler
> >> would you agree with that?
> > True. But that's manageable by adding more conductors, right ? IMHO,
> > Nova performance is bound by the number of conductors you run and I
> > like that - because that's easy to increase the capacity.
> > Also, the payload could be far smaller from the existing : instead of
> > sending a full update for a single compute_node entry, it would only
> > send the diff (+ some full syncs periodically). We would then mitigate
> > the messages increase by making sure we're sending less per message.
> No message sent is better than sending any message, regardless of whether that
> message contains an incremental update or a full object.
> >> The resource-providers proposal actually uses no update messages at
> >> all (except in the abnormal case of a compute node failing to start
> >> the resources that had previously been claimed by the scheduler). All
> >> updates are done in a single database transaction when the claim is made.
> > See, I don't think that a compute node unable to start a request is an
> > 'abnormal case'. There are many reasons why a request can't be honored
> > by the compute node :
> > - for example, the scheduler doesn't own all the compute resources
> > and thus can miss some information : for example, say that you want to
> > pin a specific pCPU and this pCPU is already assigned. The scheduler
> > doesn't know *which* pCPUs are free, it only knows *how much* are free
> > That atomic transaction (pick a free pCPU and assign it to the
> > instance) is made on the compute manager not at the exact same time
> > we're decreasing resource usage for pCPUs (because it would be done in
> > the scheduler process).
> See my response to Chris Friesen about the above.
> > - some "reserved" RAM or disk could be underestimated and
> > consequently, spawning a VM could be either taking fare more time than
> > planned (which would mean that it would be a suboptimal placement) or
> > it would fail which would issue a reschedule.
> Again, the above is an abnormal case.
> > It's a distributed problem. Neither the shared-state scheduler can
> > have an accurate view (because it will only guarantee that it will
> > *eventually* be consistent), nor the scheduler process can have an
> > accurate view (because the scheduler doesn't own the resource usage
> > that is made by compute nodes)
> If the scheduler claimed the resources on the compute node, it *would* own
> those resources, though, and therefore it *would* have a 99.9% accurate view
> of the inventory of the system.
> It's not a distributed problem if you don't make it a distributed problem.
> There's nothing wrong with the compute node having the final "say" about if
> some claimed-by-the-scheduler resources truly cannot be provided by the
> compute node, but again, that is a) an abnormal operating condition and b)
> simply would trigger a notification to the scheduler to free said resources.
> >>> 4. Design goal difference:
> >>> The fundamental design goal of the two new schedulers is different.
> >>> Copy my views from , I think it is the choice between "the loose
> >>> distributed consistency with retries" and "the strict centralized
> >>> consistency with locks".
> >> There are a couple other things that I believe we should be
> >> documenting, considering and measuring with regards to scheduler designs:
> >> a) Debuggability
> >> The ability of a system to be debugged and for requests to that
> >> system to be diagnosed is a critical component to the benefit of a
> >> particular system design. I'm hoping that by removing a lot of the
> >> moving parts from the legacy filter scheduler design (removing the
> >> caching, removing the Python-side filtering and weighing, removing
> >> the interval between which placement decisions can conflict, removing
> >> the cost and frequency of retry operations) that the
> >> resource-provider scheduler design will become simpler for operators to use.
> > Keep in mind that scheduler filters are a very handy pluggable system
> > for operators wanting to implement their own placement logic. If you
> > want to deprecate filters (and I'm not opiniated on that), just make
> > sure that you keep that extension capability.
> Understood, and I will do my best to provide some level of extensibility.
> >> b) Simplicity
> >> Goes to the above point about debuggability, but I've always tried to
> >> follow the mantra that the best software design is not when you've
> >> added the last piece to it, but rather when you've removed the last
> >> piece from it and still have a functioning and performant system.
> >> Having a scheduler that can tackle the process of tracking resources,
> >> deciding on placement, and claiming those resources instead of
> >> playing an intricate dance of keeping state caches valid will, IMHO,
> >> lead to a better scheduler.
> > I'm also very concerned by the upgrade path for both proposals like I
> > said before. I haven't yet seen how either of those 2 proposals are
> > changing the existing FilterScheduler and what is subject to change.
> > Also, please keep in mind that we support rolling upgrades, so we need
> > to support old compute nodes.
> Nothing about any of the resource-providers specs inhibits rolling upgrades. In
> fact, two of the specs (resource-providers-allocations and
> compute-node-inventory) are *only* about how to do online data migration
> from the legacy DB schema to the resource-providers schema. :)
More information about the OpenStack-dev