[openstack-dev] [nova] Distributed Database

Clint Byrum clint at fewbar.com
Tue May 3 05:10:37 UTC 2016

Excerpts from Jay Pipes's message of 2016-05-02 10:43:21 -0700:
> On 05/02/2016 11:51 AM, Mike Bayer wrote:
> > On 05/02/2016 07:38 AM, Matthieu Simonin wrote:
> >> As far as we understand the idea of an ORM is to hide the relational
> >> database with an Object oriented API.
> >
> > I actually disagree with that completely.  The reason ORMs are so
> > maligned is because of this misconception; developer attempts to use an
> > ORM so that they will need not have to have any awareness of their
> > database, how queries are constructed, or even its schema's design;
> > witness tools such as Django ORM and Rails ActiveRecord which promise
> > this.   You then end up with an inefficient and unextensible mess
> > because the developers never considered anything about how the database
> > works or how it is queried, nor do they even have easy ways to monitor
> > or control it while still making use of the tool.   There are many blog
> > posts and articles that discuss this and it is in general known as the
> > "object relational impedance mismatch".
> >
> > SQLAlchemy's success comes from its rejection of this entire philosophy.
> >   The purpose of SQLAlchemy's ORM is not to "hide" anything but rather
> > to apply automation to the many aspects of relational database
> > communication as well as row->object mapping that otherwise express
> > themselves in an application as either a large amount of repetitive
> > boilerplate throughout an application or as an awkward series of ad-hoc
> > abstractions that don't really do the job very well.   SQLAlchemy is
> > designed to expose both the schema design as well as the structure of
> > queries completely.   My talk at [1] goes into this topic in detail
> > including specific API architectures that facilitate this concept.
> >
> > It's for that reason that I've always rejected notions of attempting to
> > apply SQLAlchemy directly on top of a datastore that is explicitly
> > non-relational.   By doing so, you remove a vast portion of the
> > functionality that relational databases provide and there's really no
> > point in using a tool like SQLAlchemy that is very explicit about DDL
> > and SQL on top of that kind of database.
> >
> > To effectively put SQLAlchemy on top of a non-relational datastore, what
> > you really want to do is build an entire SQL engine on top of it.  This
> > is actually feasible; I was doing work for the now-defunct FoundationDB
> > (was bought by Apple) who had a very good implementation of
> > SQL-on-top-of-distributed keystore going, and the Cockroach and TiDB
> > projects you mention are definitely the most appropriate choice to take
> > if a certain variety of distribution underneath SQL is desired.
> Well said, Mike, on all points above.
> <snip>
> > But also, w.r.t. Cells there seems to be some remaining debate over why
> > exactly a distributed approach is even needed.  As others have posted, a
> > single MySQL database, replicated across Galera or not, scales just fine
> > for far more data than Nova ever needs to store.  So it's not clear why
> > the need for a dramatic rewrite of its datastore is called for.
> Cells(v1) in Nova *already has* completely isolated DB/MQ for each cell, 
> and there are bunch of 
> duplicated-but-slightly-different-and-impossible-to-maintain code paths 
> in the scheduler and compute manager. Part of the cellsv2 effort is to 
> remove these duplicated code paths.
> Cells are, as much as anything else, an answer to an MQ scaling problem, 
> less so an answer to a DB scaling problem. Having a single MQ bus for 
> tens of thousands of compute nodes is just not tenable -- at least with 
> the message passing patterns and architecture that we use today...
> Finally, Cells also represent a failure domain for the control plane. If 
> a network partition occurs between a cell and the top-level API layer, 
> no other cell is affected by the disruption.
> Now, what does all this mean with regards to whether to use a single 
> distributed database solution versus a single RDBMS versus many isolated 
> RDBMS instances per cell? Not sure. Arguments can be made for all three 
> approaches clearly. Depending on what folks' priorities are with regards 
> to simplicity, scale, and isolation of failure domains, the "right" 
> choice is tough to determine.
> On the one hand, using a single distributed datastore like Cassandra for 
> everything would make things conceptually easy to reason about and make 
> OpenStack clouds much easier to deploy at scale.
> On the other hand, porting things to Cassandra (or any other NoSQL 
> solution) would require a complete overhaul of the way *certain* data is 
> queried in the Nova subsystems. Examples of Cassandra's poor fit for 
> some types of data are quota and resource usage aggregation queries. 
> While Cassandra does support some aggregation via CQL in recent 
> Cassandra versions, Cassandra simply wasn't built for this kind of data 
> access pattern and doing anything reasonably complicated would require 
> using Pig/Hive/Map-Reduce-y stuff, which kind of defeats the simplicity 
> arguments.
> Where queries against resource usage aggregate data are a great fit for 
> a single RDBMS system, doing these queries across multiple RDBMS systems 
> (i.e. in the child cell databases) is a recipe for nightmares and you'd 
> essentially need to implement a Map-Reduce-y type of system to answer 
> these queries and you will lose transactional semantics when you do that.
> This is why the decision was made to bring this set of data into a 
> top-level DB (either the API database or a new broken-out scheduler DB).
> Finally, doing list operations with pagination, sorting, and free-form 
> filters across dozens or more child cell databases is similarly 
> nightmare-inducing. RDBMS (single or not) aren't a great fit for this 
> type of data access pattern. Instead, something like ElasticSearch is a 
> much better fit.
> So, bottom line, there's no simple answer. At this point, my personal 
> preferred solution that I think balances simplicity with the need to 
> have useful and friendly APIs along with an ability to isolate a failure 
> domain as effectively as possible is the following:

I think you've stated the problem accurately and without bias, so I
thank you for that.

However, I don't think the current cells v2 design does enough to isolate
failure domains to warrant its complexity.

> 1) Use a single top-level DB for the API database (mapping tables, 
> request and build spec tables, instance types (flavors) and keypairs. 
> This DB could be an RDBMS or Cassandra, depending on the deployer's 
> preferences around availability guarantees, existing sysadmin/DBA 
> knowledge and other factors.
> 2) Use a single top-level RDBMS for the scheduler database (resource 
> classes, resource providers, resource pools, aggregates, aggregate 
> mapping tables, capabilities, inventories and allocations tables). The 
> querying patterns for this type of data is specifically an excellent fit 
> for an RDBMS and a single RDBMS should be able to service a *very large* 
> (500K+ resource providers) deployment with minimal caching needed to 
> handle quota, resource claim and resource capacity queries.
> 3) Use a single DB in each cell to store information about the instances 
> located in that cell and the instance tasks being performed against 
> those instances. This DB could be either an RDBMS or Cassandra, 
> depending on the deployer's preferences. The key here is to keep the 
> cell a fully isolated failure domain.

Why are we ok with the API database being global, but not the cell
databases? This is an awful lot of trouble to go through to isolate a
tiny section of data that isn't even what users interact with directly
for the most part.

It's really time that we take a long hard look at the way RabbitMQ is
being used, and think about using something else that others have proven,
rather than inventing our own complicated solution to this problem.

> 4) Have all instance listing operations in the Compute API proxy to a 
> call to Project Searchlight and ElasticSearch to handle the cross-cell 
> queries against the instance and instance update (task) information.

Or just keep the instances in a single database, and use regular old
list queries.

Meanwhile we haven't even started talking about how to make sure
elasticsearch would provide the same consistency our users expect from
the API now (or how to transition them to an API that would allow for
some latency in updates for list queries).

> I think the above provides a pretty good balance of using tools that 
> fits the data access patterns properly and keeping the original idea of 
> isolated failure domains a strong priority for the architecture of Nova.

This wouldn't be the worst setup, for sure. However, I think we can do
this more simply if we take a look one layer below Nova.

The future I want to see experimented with:

* Some brokerless RPC system that doesn't need scaling of a stateful
  service at its core. 0mq, direct http, amqp 1.0 brokerless.. all
  options that wouldn't require too much unfamiliar territory.

* Notifications split from RPC so RabbitMQ can do what RabbitMQ is meant

* Central databases for writes, replication, stateless proxies, and/or
  galera-readonlies for reads, no sharding. Extra attention paid to
  failover handling (maybe in those proxies), to make failure domain
  isolation less important.

* Scheduler and resource tracker based on one of the current experiments
  and able to scale out efficiently.

I hope to write something more complete up on my blog as soon as I finish
migrating my blog off my Drizzle-forked Wordpress install. ;)

More information about the OpenStack-dev mailing list