Open Stack

Mon Jul 20 17:40:05 UTC 2015

Excerpts from Jesse Cook's message of 2015-07-20 07:48:46 -0700:
> 
> On 7/15/15, 9:18 AM, "Ed Leafe" <ed at leafe.com> wrote:
> 
> >-----BEGIN PGP SIGNED MESSAGE-----
> >Hash: SHA512
> >
> >Changing the architecture of a complex system such as Nova is never
> >easy, even when we know that the design isn't working as well as we
> >need it to. And it's even more frustrating because when the change is
> >complete, it's hard to know if the improvement, if any, was worth it.
> >
> >So I had an idea: what if we ran a test of that architecture change
> >out-of-tree? In other words, create a separate deployment, and rip out
> >the parts that don't work well, replacing them with an alternative
> >design. There would be no Gerrit reviews or anything that would slow
> >down the work or add load to the already overloaded reviewers. Then we
> >could see if this modified system is a significant-enough improvement
> >to justify investing the time in implementing it in-tree. And, of
> >course, if the test doesn't show what was hoped for, it is scrapped
> >and we start thinking anew.
> 
> +1
> >
> >The important part in this process is defining up front what level of
> >improvement would be needed to make considering actually making such a
> >change worthwhile, and what sort of tests would demonstrate whether or
> >not whether this level was met. I'd like to discuss such an experiment
> >next week at the Nova mid-cycle.
> >
> >What I'd like to investigate is replacing the current design of having
> >the compute nodes communicating with the scheduler via message queues.
> >This design is overly complex and has several known scalability
> >issues. My thought is to replace this with a Cassandra [1] backend.
> >Compute nodes would update their state to Cassandra whenever they
> >change, and that data would be read by the scheduler to make its host
> >selection. When the scheduler chooses a host, it would post the claim
> >to Cassandra wrapped in a lightweight transaction, which would ensure
> >that no other scheduler has tried to claim those resources. When the
> >host has built the requested VM, it will delete the claim and update
> >Cassandra with its current state.
> >
> >One main motivation for using Cassandra over the current design is
> >that it will enable us to run multiple schedulers without increasing
> >the raciness of the system. Another is that it will greatly simplify a
> >lot of the internal plumbing we've set up to implement in Nova what we
> >would get out of the box with Cassandra. A third is that if this
> >proves to be a success, it would also be able to be used further down
> >the road to simplify inter-cell communication (but this is getting
> >ahead of ourselves...). I've worked with Cassandra before and it has
> >been rock-solid to run and simple to set up. I've also had preliminary
> >technical reviews with the engineers at DataStax [2], the company
> >behind Cassandra, and they agreed that this was a good fit.
> >
> >At this point I'm sure that most of you are filled with thoughts on
> >how this won't work, or how much trouble it will be to switch, or how
> >much more of a pain it will be, or how you hate non-relational DBs, or
> >any of a zillion other negative thoughts. FWIW, I have them too. But
> >instead of ranting, I would ask that we acknowledge for now that:
> 
> Call me an optimist, I think this can work :)
> 
> I would prefer a solution that avoids state management all together and
> instead depends on each individual making rule-based decisions using their
> limited observations of their perceived environment. Of course, this has
> certain emergent behaviors you have to learn from, but on the upside, no
> more braiding state throughout the system. I don¹t like the assumption
> that it has to be a global state management problem when it doesn¹t have
> to be. That being said, I¹m not opposed to trying a solution like you
> described using Cassandra or something similar. I generally support
> improvements :)
> 

> >
> >a) it will be disruptive and painful to switch something like this at
> >this point in Nova's development
> >b) it would have to provide *significant* improvement to make such a
> >change worthwhile
> >
> >So what I'm asking from all of you is to help define the second part:
> >what we would want improved, and how to measure those benefits. In
> >other words, what results would you have to see in order to make you
> >reconsider your initial "nah, this'll never work" reaction, and start
> >to think that this is will be a worthwhile change to make to Nova.
> 
> I¹d like to see n build requests within 1 second each be successfully
> scheduled to a host that has spare capacity with only say a total system
> capacity of n * 1.10 where n >= 10000, each cell having ~100 hosts, the
> number of hosts is >= n * 0.10 and <= n * 0.90, and the number of
> schedulers is >= 2.
> 
> For example:
> 
> Build requests: 10000 in 1 second
> Slots for flavor requested: 11000
> Hosts that can build flavor: 7500
> Number of schedulers: 3
> Number of cells: 75 (each with 100 hosts)
> 

This is right on, though one thing missing is where the current code
fails this test. It would be great to have the numbers above available
as a baseline so we can denote progress in any experiment.

Also, I'm a little confused why you'd want cells still, but perhaps the
idea is to get the scale of one cell so high, you don't actually ever
want cells, since at that point you should really be building new regions?

To your earlier point about state being abused in the system, I
totally 100% agree. In the past I've wondered a lot if there can be a
worker model, where compute hosts all try to grab work off queues if
they have available resources. So API requests for boot/delete don't
change any state, they just enqueue a message. Queues would be matched
up to resources and the more filter choices, the more queues. Each
time a compute node completed a task (create vm, destroy vm) it would
re-evaluate all of the queues and subscribe to the ones it could satisfy
right now. Quotas would simply be the first stop for the enqueued create
messages, and a final stop for the enqueued delete messages (once its
done, release quota). If you haven't noticed, this would agree with Robert
Collins's suggestion that something like Kafka is a technology more suited
to this (or my favorite old-often-forgotten solution to this , Gearman. ;)

This would have no global dynamic state, and very little local dynamic
state. API, conductor, and compute nodes simply need to know all of the
choices users are offered, and there is no scheduler at runtime, just
a predictive queue-list-manager that only gets updated when choices are
added or removed. This would relieve a ton of the burden currently put
on the database by scheduling since the only accesses would be simple
read/writes (that includes 'server-list' type operations since that
would read a single index key).

Anyway, That's way off track, but I think this kind of thinking needs
to happen and be taken seriously without turning into a bikeshed or fist
fight. I don't think that will happen naturally until we start measuring
where we are, and listening to operators as to where they'd like to be
in relation to that.

So in the interest of ending a long message with actions rather than
words, lets get some measurement going. Rally? Something else? What can
we do to measure this?

Open Stack

[openstack-dev] [nova] Proposal for an Experiment

OpenStack

Community

Documentation

Branding & Legal