[openstack-dev] [nova] Distributed Database

Jay Pipes jaypipes at gmail.com
Mon May 2 17:43:21 UTC 2016

On 05/02/2016 11:51 AM, Mike Bayer wrote:
> On 05/02/2016 07:38 AM, Matthieu Simonin wrote:
>> As far as we understand the idea of an ORM is to hide the relational
>> database with an Object oriented API.
> I actually disagree with that completely.  The reason ORMs are so
> maligned is because of this misconception; developer attempts to use an
> ORM so that they will need not have to have any awareness of their
> database, how queries are constructed, or even its schema's design;
> witness tools such as Django ORM and Rails ActiveRecord which promise
> this.   You then end up with an inefficient and unextensible mess
> because the developers never considered anything about how the database
> works or how it is queried, nor do they even have easy ways to monitor
> or control it while still making use of the tool.   There are many blog
> posts and articles that discuss this and it is in general known as the
> "object relational impedance mismatch".
> SQLAlchemy's success comes from its rejection of this entire philosophy.
>   The purpose of SQLAlchemy's ORM is not to "hide" anything but rather
> to apply automation to the many aspects of relational database
> communication as well as row->object mapping that otherwise express
> themselves in an application as either a large amount of repetitive
> boilerplate throughout an application or as an awkward series of ad-hoc
> abstractions that don't really do the job very well.   SQLAlchemy is
> designed to expose both the schema design as well as the structure of
> queries completely.   My talk at [1] goes into this topic in detail
> including specific API architectures that facilitate this concept.
> It's for that reason that I've always rejected notions of attempting to
> apply SQLAlchemy directly on top of a datastore that is explicitly
> non-relational.   By doing so, you remove a vast portion of the
> functionality that relational databases provide and there's really no
> point in using a tool like SQLAlchemy that is very explicit about DDL
> and SQL on top of that kind of database.
> To effectively put SQLAlchemy on top of a non-relational datastore, what
> you really want to do is build an entire SQL engine on top of it.  This
> is actually feasible; I was doing work for the now-defunct FoundationDB
> (was bought by Apple) who had a very good implementation of
> SQL-on-top-of-distributed keystore going, and the Cockroach and TiDB
> projects you mention are definitely the most appropriate choice to take
> if a certain variety of distribution underneath SQL is desired.

Well said, Mike, on all points above.


> But also, w.r.t. Cells there seems to be some remaining debate over why
> exactly a distributed approach is even needed.  As others have posted, a
> single MySQL database, replicated across Galera or not, scales just fine
> for far more data than Nova ever needs to store.  So it's not clear why
> the need for a dramatic rewrite of its datastore is called for.

Cells(v1) in Nova *already has* completely isolated DB/MQ for each cell, 
and there are bunch of 
duplicated-but-slightly-different-and-impossible-to-maintain code paths 
in the scheduler and compute manager. Part of the cellsv2 effort is to 
remove these duplicated code paths.

Cells are, as much as anything else, an answer to an MQ scaling problem, 
less so an answer to a DB scaling problem. Having a single MQ bus for 
tens of thousands of compute nodes is just not tenable -- at least with 
the message passing patterns and architecture that we use today...

Finally, Cells also represent a failure domain for the control plane. If 
a network partition occurs between a cell and the top-level API layer, 
no other cell is affected by the disruption.

Now, what does all this mean with regards to whether to use a single 
distributed database solution versus a single RDBMS versus many isolated 
RDBMS instances per cell? Not sure. Arguments can be made for all three 
approaches clearly. Depending on what folks' priorities are with regards 
to simplicity, scale, and isolation of failure domains, the "right" 
choice is tough to determine.

On the one hand, using a single distributed datastore like Cassandra for 
everything would make things conceptually easy to reason about and make 
OpenStack clouds much easier to deploy at scale.

On the other hand, porting things to Cassandra (or any other NoSQL 
solution) would require a complete overhaul of the way *certain* data is 
queried in the Nova subsystems. Examples of Cassandra's poor fit for 
some types of data are quota and resource usage aggregation queries. 
While Cassandra does support some aggregation via CQL in recent 
Cassandra versions, Cassandra simply wasn't built for this kind of data 
access pattern and doing anything reasonably complicated would require 
using Pig/Hive/Map-Reduce-y stuff, which kind of defeats the simplicity 

Where queries against resource usage aggregate data are a great fit for 
a single RDBMS system, doing these queries across multiple RDBMS systems 
(i.e. in the child cell databases) is a recipe for nightmares and you'd 
essentially need to implement a Map-Reduce-y type of system to answer 
these queries and you will lose transactional semantics when you do that.

This is why the decision was made to bring this set of data into a 
top-level DB (either the API database or a new broken-out scheduler DB).

Finally, doing list operations with pagination, sorting, and free-form 
filters across dozens or more child cell databases is similarly 
nightmare-inducing. RDBMS (single or not) aren't a great fit for this 
type of data access pattern. Instead, something like ElasticSearch is a 
much better fit.

So, bottom line, there's no simple answer. At this point, my personal 
preferred solution that I think balances simplicity with the need to 
have useful and friendly APIs along with an ability to isolate a failure 
domain as effectively as possible is the following:

1) Use a single top-level DB for the API database (mapping tables, 
request and build spec tables, instance types (flavors) and keypairs. 
This DB could be an RDBMS or Cassandra, depending on the deployer's 
preferences around availability guarantees, existing sysadmin/DBA 
knowledge and other factors.

2) Use a single top-level RDBMS for the scheduler database (resource 
classes, resource providers, resource pools, aggregates, aggregate 
mapping tables, capabilities, inventories and allocations tables). The 
querying patterns for this type of data is specifically an excellent fit 
for an RDBMS and a single RDBMS should be able to service a *very large* 
(500K+ resource providers) deployment with minimal caching needed to 
handle quota, resource claim and resource capacity queries.

3) Use a single DB in each cell to store information about the instances 
located in that cell and the instance tasks being performed against 
those instances. This DB could be either an RDBMS or Cassandra, 
depending on the deployer's preferences. The key here is to keep the 
cell a fully isolated failure domain.

4) Have all instance listing operations in the Compute API proxy to a 
call to Project Searchlight and ElasticSearch to handle the cross-cell 
queries against the instance and instance update (task) information.

I think the above provides a pretty good balance of using tools that 
fits the data access patterns properly and keeping the original idea of 
isolated failure domains a strong priority for the architecture of Nova.


More information about the OpenStack-dev mailing list