[openstack-dev] [trove] Adding support for HBase in Trove

Amrith Kumar amrith at tesora.com
Thu Jan 7 21:59:30 UTC 2016


> -----Original Message-----
> From: michael mccune [mailto:msm at redhat.com]
> Sent: Thursday, January 07, 2016 3:12 PM
> To: openstack-dev at lists.openstack.org
> Subject: Re: [openstack-dev] [trove] Adding support for HBase in Trove
> 
> On 01/07/2016 11:59 AM, Amrith Kumar wrote:
> >  From the things that you and Pete (Peter MacKinnon) are saying, I don't
> understand why there is an objection to accepting the currently proposed
> implementation which is clearly for single node deployments? Both
> Standalone and Pseudo-Distributed are by definition, explicitly, necessarily,
> absolutely, positively, definitely single node. I can't be more explicit about
> that. That's all that is being proposed at this time. See more comments
> below.
> 
> i didn't think i explicitly objected to the spec, if it seems that way then i
> apologize. after reading the spec and the comments, it seemed that there
> was some question about engagement with the sahara team. i wanted to
> help bring some light to the issues surrounding deploying hbase and thought
> it would be good to participate in the discussion.

You are correct Michael. There was a suggestion that we should engage with the Sahara team (in the Trove team meeting yesterday) and that is what prompted this email thread. So I appreciate your participation as one who is a member of the Sahara team.

> 
> > Further, the current proposal also chooses an implementation strategy that
> makes it much easier to handle fully-distributed in a different way in the
> future. Consider this, Trove could equally well have dealt with HBase using a
> single datastore for all operating modes. In the current implementation, one
> would create a HBase standalone instance using a command that included:
> >
> > 	--datastore hbase-standalone
> >
> > And a pseudo-distributed instance by including
> >
> > 	--datastore hbase-pseudo-distributed.
> >
> 
> and this delineation sounds reasonable to me
> 
> > Trove could equally well function by having a single datastore (hbase) but
> this would make hbase-fully-distributed harder to do in a different way in the
> future. I consciously eschewed that path, for this very specific reason; it
> would limit choice in the future.
> 
> agreed
> 
> > Now, the implementation behind hbase-fully-distributed could be a
> custom Trove guest agent that could (if we decided to go that route) interact
> with Sahara. However, an alternative implementation of hbase-fully-
> distributed could orchestrate everything natively in Trove. There is much
> flexibility in the current proposal, and I submit to you that this is being lost in
> your reading of the specification and the current implementation as
> proposed.
> 
> i don't think your characterization of my reading comprehension is fair.
> as i stated earlier, i wanted to participate in the discussion surrounding
> deploying a technology that sahara currently deploys. fwiw, i agree with what
> you are saying here, but i also think it is axiomatic, the trove team can choose
> whichever path it would like for implementation.
> 
> >> i think this sounds reasonable, as long as we are limiting it to
> >> standalone mode. if the deployments start to take on a larger scope i
> >> agree it would be useful to leverage sahara for provisioning and scaling.
> >
> > Why only standalone? The current proposal explicitly covers only
> standalone and pseudo-distributed which are both valid strictly (add other
> adjectives here to taste) single node topologies and the currently submitted
> specification specifically carves out fully-distributed operation as requiring
> further thought and contemplation.
> 
> i think starting with standalone mode (and not pseudo-distributed) is a more
> conservative approach to this. my reason for suggesting limiting this to
> standalone is that even in pseudo-distributed mode the need for managing
> hdfs and zookeeper are present, i wanted to highlight some of of the overlap
> and the issues that will start to creep in surrounding this deployment.
> 

The current code (submitted for review) provides both standalone and pseudo-distributed support. You will observe that the standalone and pseudo-distributed implementations do install zookeeper. As you are no doubt aware, one of the recommended ways to force the HBase Master server to always bind to a well-known port in favor of the ephemeral ports is to stipulate  hbase.cluster.distributed is True (see https://review.openstack.org/#/c/262048/5/scripts/files/elements/ubuntu-hbase-standalone/install.d/20-install-hbase line 121). So, as it turns out, the code to deploy hdfs and zookeeper is already part of the proposed implementation.


> >> as the hbase installation grows beyond the standalone mode there will
> >> necessarily need to be hdfs and zookeeper support to allow for a
> >> proper production deployment. this also brings up questions of
> >> allowing the end- users to supply configurations for the hdfs and
> >> zookeeper processes, not to mention enabling support for high availability
> hdfs.
> >
> > These are things that Trove already addresses, albeit in a different way
> than Sahara. Users can, as it turns out, specify configuration groups which can
> then be used to launch new instances, and can also be associated with
> groups of instances.
> 
> i am merely identifying issues that trove will need to reproduce, i'm not
> deeply familiar with the configuration options that trove exposes but i am
> guessing that it is currently not generating the configurations specific to hdfs
> and zookeeper.
> 

It is equally important, I think, to realize that Trove doesn't have to produce a whole lot of new code to handle this as it already has a robust framework that handles a number of databases. Therefore, with a relatively small code footprint a prototype that will allow much more flexible configuration support has been prototyped (that has not been sent up for review yet). The majority of that code is a codec for XML, the rest of it is almost completely handled by the framework with the exception of a file specifying the configuration options that are to be supported.

Therefore, I'd like to reiterate that Trove, by its very design was intended to support a number of databases and therefore already has much of the framework in place to add support for a new database. Therefore there isn't a lot of new code that must be 'reproduced' to add this support.

> >> i can envision a scenario where trove could use sahara to provision
> >> and manage the clusters for hbase/hdfs/zk. this does pose some
> >> questions as we'd have to determine how the trove guest agent would
> >> be installed on the nodes, if there will need to be custom
> >> configurations used by trove, and if sahara will need to provide a
> >> plugin for bare (meaning no data processing
> >> framework) hbase/hdfs/zk clusters. but, i think these could be solved
> >> by either using custom images or a plugin in sahara that would
> >> install the necessary agents/configurations.
> >
> > Let us not underestimate the effort for an end user to now deploy one
> more project. To a user already using Trove for a myriad of databases,
> requiring Sahara for supporting HBase Standalone sounds (to put it bluntly) a
> burden. Requiring it for Fully-Distributed mode may have some development
> benefits but it remains to be seen whether those benefits are really worth
> the contortions that Trove would have to go through. And in the Trove
> architecture, there is flexibility as described above to have multiple possible
> implementations for fully-distributed, one that would interface with Sahara
> and another that didn't have to.
> 
> i agree about the installation issues when we are talking about standalone
> versus distributed. as for the contortions that trove may have to go through
> to integrate with sahara, i think it would be worth it, but i'm probably biased
> here ;)
> 
> > Let's be clear that for a person who wants a fully configurable Hadoop
> based deployment with more control, Sahara may be the best option. And to
> one who wants even more control, maybe doing it themselves with Nova
> and customer Glance Images is the way to go. Similarly, a Database-as-a-
> Service comes with the understood boundaries imposed by the "as-a-
> Service" deployment. Not all configuration options may be tweakable with a
> DBaaS, that's well known an understood, not just in Trove but also, for
> example, in Amazon RDS, RedShift or any of the other database-as-a-service
> implementations. The same would be true in fully-distributed as well, in the
> proposal that is currently under review. I submit to you that this nuance is
> being lost in your reading.
> 
> i'd like to think that for someone who wants a fully configurable hadoop base
> deployment, sahara is the best option =)
> 
> i think we generally agree here about the deployment of "-aaS" services in
> openstack, and again i disagree with your characterization of my reading
> comprehension...
> 
> >> of course, this does add a layer of complexity as operators who wish
> >> this type of deployment will need to have both trove and sahara, but
> >> imo this would be easier than replicating the work that sahara has
> >> done with these technologies.
> >
> > I think this is where our opinions differ, as the 'replication' isn't all that
> much given the fact that Trove already provides capabilities to cluster
> databases. But, with that said, nothing in the current specification locks us
> into a specific deployment strategy in the future, nor does it preclude
> multiple implementations of fully-distributed, one which could leverage
> Sahara and one which didn't.
> 
> respectfully, i think there is more effort involved with the management of
> the pseudo-distributed mode than standalone, and that is more where my
> comments are oriented towards. mind you, provisioning might be a simple
> matter for trove as it stands now, but i think the potential for issues could get
> deeper with pseudo-distributed.

Here, again, I want to point out that the issues will definitely be more with pseudo-distributed than with standalone. But, Trove is already a multi-database framework and therefore adding support for one more database doesn't require a whole new implementation.

> 
> i'm glad that you are open to the idea of implementations that may involve
> other projects (namely sahara) in the future. as i said in the beginning, given
> the comments about sahara in the spec and the review i wanted to make
> sure we got a few more eyes on this to bring our experience to the table.

Absolutely, that's the intent of the ML conversation.

> 
> regards,
> mike
> 
> __________________________________________________________
> ________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-
> request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list