[openstack-dev] [trove] Adding support for HBase in Trove

Amrith Kumar amrith at tesora.com
Wed Jan 6 23:15:26 UTC 2016


TL;DR Should Trove treat HBase as a special database because one use case is as part of a large multi-node Hadoop cluster, and therefore either not support it at all, or necessarily use Sahara to provision and manage a cluster? There are pro's and con's and it is argued that the con's outweigh the pro's and a blueprint/specification, and an implementation for basic Trove support for HBase independent of Sahara has been submitted for review. See [3], [4] and [5]. The benefits include the ability to provide the commonly used (in development) standalone mode operation, and eliminate the dependency on an additional OpenStack project thereby simplifying deployment. Comments and feedback are welcome on the implementation, as well as the specification and the approach.

The long version follows below.

The OpenStack Trove mission is to provide scalable and reliable Cloud Database as a Service provisioning functionality for both relational and non-relational database engines, and to continue to improve its fully-featured and extensible open source framework [1].

An important aspect of the Trove value proposition is that it provides a common control plane, a common API, and a common set of abstractions are used to manage a number of different relational, and non-relational database technologies. The common API contains primitives to create database instances and clusters of a number of databases including MySQL (MariaDB, Percona too), PostgreSQL, MongoDB, Cassandra, CouchDB, Couchbase, IBM DB2, Vertica, and Redis. 

Cluster support is also available for a number of databases including MongoDB, Percona XtraDB cluster and Vertica, with more to come imminently. 

In effect, Trove is a framework for provisioning and managing the lifecycle of a number of different database technologies; it provides only the control plane. Users can do things like provisioning instances and clusters, resizing them, taking backups and creating new instances and clusters from previous backups, establish and manage complex topologies including replication and clustering, and resize instances and clusters. 

Trove does interfere with the data plane, the applications interact directly with the database using the native API's for each database technology.

Users of OpenStack look to Trove to provide a consistent set of interfaces for managing their database resources in a variety of use-cases ranging from small-scale prototyping, development, testing, and all the way through production. Apache HBase is an open-source, distributed, versioned, non-relational database [2] and users of HBase face many of the challenges that Trove addresses for other databases. Therefore adding support for HBase in Trove seems not only reasonable, but also consistent with the goal of the (Trove) project.

A spec proposing the addition of HBase support for Trove was submitted [3] and a first phase of code implementing this HBase support has also been submitted for review [4], [5]. The process that has been followed is consistent with other Trove datastores; add basic support and then progressively augment it in subsequent releases. The code submitted allows you to provision an HBase instance (which will launch on a Nova instance), build an HBase guest image using the elements provided, resize the storage and the instance, take a "backup" of the instance and store that backup on Swift, and at a later time you can launch a new instance from that "backup".

One can operate HBase with or without HDFS; in fact HBase documents the standalone mode of operation [6] where HBase is completely operational on a single node and data is stored on the local file system. This standalone mode provides a very useful construct for development and testing, and at a later stage an application can be seamlessly migrated to work with an HBase installation of some other "run mode" like "Fully Distributed".

Code submitted in [4] and [5] as described in [3] implement support for two modes of operation namely "Standalone" and "Pseudo-Distributed". At a later stage, support will be added for "Fully Distributed" consistent with the way in which clustering support was delivered for other datastores like MySQL and MongoDB.

Some have opined that Trove should not directly get into the business of orchestrating Hadoop Clusters or anything to do with HBase, arguing that this is something that Sahara already does, and should remain the sole domain of Sahara.

I believe that since HBase is perfectly operable without HDFS, it seems inappropriate to tightly couple HBase with Sahara whose primary motivation is to provision 'data-intensive application clusters' [7]. Furthermore, as we have found with other datastores, it is my belief that having a common implementation model across multiple deployment topologies is a benefit for Trove. Other considerations such as similarity to other databases supported by Trove motivated a choice as illustrated in the specification. An architecture where Trove can function entirely independent of Sahara is also a benefit for end users, and a model where Trove has dependencies only on other core OpenStack services considerably simplifies the deployment.

Comments and feedback are welcome on the code, as well as the specification and the approach.

References:

[1] https://wiki.openstack.org/wiki/Trove#Mission_Statement
[2] https://hbase.apache.org/
[3] https://review.openstack.org/#/c/256079
[4] https://review.openstack.org/#/c/262048/
[5] https://review.openstack.org/#/c/262815/
[6] http://hbase.apache.org/0.94/book/standalone_dist.html
[7] https://wiki.openstack.org/wiki/Sahara

Thanks,

-amrith

--
Amrith Kumar, CTO                   | amrith at tesora.com
Tesora, Inc                         | @amrithkumar
125 CambridgePark Drive, Suite 400  | http://www.tesora.com
Cambridge, MA. 02140                |









More information about the OpenStack-dev mailing list