[openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Clint Byrum clint at fewbar.com
Wed Feb 24 07:36:12 UTC 2016

Excerpts from Jay Pipes's message of 2016-02-23 16:10:46 -0800:
> On 02/22/2016 04:23 AM, Sylvain Bauza wrote:
> > I won't argue against performance here. You made a very nice PoC for
> > testing scaling DB writes within a single python process and I trust
> > your findings. While I would be naturally preferring some shared-nothing
> > approach that can horizontally scale, one could mention that we can do
> > the same with Galera clusters.
> a) My benchmarks aren't single process comparisons. They are 
> multi-process benchmarks.
> b) The approach I've taken is indeed shared-nothing. The scheduler 
> processes do not share any data whatsoever.

I think this is a matter of perspective. What I read from Sylvain's
message was that the approach you've taken shares state in a database,
and shares access to all compute nodes.

I also read in to Sylvain's comments taht what he was referring to was
a system where the compute nodes divide up the resources and never share
anything at all.

> c) Galera isn't horizontally scalable. Never was, never will be. That 
> isn't its strong-suit. Galera is best for having a 
> synchronously-replicated database cluster that is incredibly easy to 
> manage and administer but it isn't a performance panacea. It's focus is 
> on availability not performance :)

I also think this is a matter of perspective. Galera is actually
fantastically horizontally scalable in any situation where you have a
very high ratio of reads to writes with a need for consistent reads.

However, for OpenStack's needs, we are typically pretty low on that ratio.

> > That said, most of the operators run a controller/compute situation
> > where all the services but the compute node are hosted on 1:N hosts.
> > Implementing the resource-providers-scheduler BP (and only that one)
> > will dramatically increase the number of writes we do on the scheduler
> > process (ie. on the "controller" - quoting because there is no notion of
> > a "controller" in Nova, it's just a deployment choice).
> Yup, no doubt about it. It won't increase the *total* number of writes 
> the system makes, just the concentration of those writes into the 
> scheduler processes. You are trading increased writes in the scheduler 
> for the challenges inherent in keeping a large distributed cache system 
> valid and fresh (which itself introduces a different kind of writes).

Funny enough, I think of Galera as a large distributed cache that is
always kept valid and fresh. The challenges of doing this for a _busy_
cache are not unique to Galera.

> > That's a big game changer for operators who are currently capping their
> > capacity by adding more conductors. It would require them to do some DB
> > modifications to be able to scale their capacity. I'm not against that,
> > I just say it's a big thing that we need to consider and properly
> > communicate if agreed.
> Agreed completely. I will say, however, that on a 1600 compute node 
> simulation (~60K variably-sized instances), an untuned stock MySQL 5.6 
> database with 128MB InnoDB buffer pool size barely breaks a sweat on my 
> local machine.

That agrees with what I've seen as well. We're talking about tables of
integers for the most part, so your least expensive SSD's can keep up
with this load for many many thousands of computes.

I'd actually also be interested if this has a potential to reduce the
demand on the message bus. I've been investigating this for a while, and I
found that RabbitMQ will happily consume 5 high end CPU cores on a single box
just to serve the needs of 1000 idle compute nodes. I am sorry that I
haven't read enough of the details in your proposal, but doesn't this
mean there'd be quite a bit less load on the MQ if the only time
messages are happening is for direct RPC dispatches and error reporting?

More information about the OpenStack-dev mailing list