[openstack-dev] [trove] Adding support for HBase in Trove
msm at redhat.com
Thu Jan 7 20:12:11 UTC 2016
On 01/07/2016 11:59 AM, Amrith Kumar wrote:
> From the things that you and Pete (Peter MacKinnon) are saying, I don't understand why there is an objection to accepting the currently proposed implementation which is clearly for single node deployments? Both Standalone and Pseudo-Distributed are by definition, explicitly, necessarily, absolutely, positively, definitely single node. I can't be more explicit about that. That's all that is being proposed at this time. See more comments below.
i didn't think i explicitly objected to the spec, if it seems that way
then i apologize. after reading the spec and the comments, it seemed
that there was some question about engagement with the sahara team. i
wanted to help bring some light to the issues surrounding deploying
hbase and thought it would be good to participate in the discussion.
> Further, the current proposal also chooses an implementation strategy that makes it much easier to handle fully-distributed in a different way in the future. Consider this, Trove could equally well have dealt with HBase using a single datastore for all operating modes. In the current implementation, one would create a HBase standalone instance using a command that included:
> --datastore hbase-standalone
> And a pseudo-distributed instance by including
> --datastore hbase-pseudo-distributed.
and this delineation sounds reasonable to me
> Trove could equally well function by having a single datastore (hbase) but this would make hbase-fully-distributed harder to do in a different way in the future. I consciously eschewed that path, for this very specific reason; it would limit choice in the future.
> Now, the implementation behind hbase-fully-distributed could be a custom Trove guest agent that could (if we decided to go that route) interact with Sahara. However, an alternative implementation of hbase-fully-distributed could orchestrate everything natively in Trove. There is much flexibility in the current proposal, and I submit to you that this is being lost in your reading of the specification and the current implementation as proposed.
i don't think your characterization of my reading comprehension is fair.
as i stated earlier, i wanted to participate in the discussion
surrounding deploying a technology that sahara currently deploys. fwiw,
i agree with what you are saying here, but i also think it is axiomatic,
the trove team can choose whichever path it would like for implementation.
>> i think this sounds reasonable, as long as we are limiting it to standalone
>> mode. if the deployments start to take on a larger scope i agree it would be
>> useful to leverage sahara for provisioning and scaling.
> Why only standalone? The current proposal explicitly covers only standalone and pseudo-distributed which are both valid strictly (add other adjectives here to taste) single node topologies and the currently submitted specification specifically carves out fully-distributed operation as requiring further thought and contemplation.
i think starting with standalone mode (and not pseudo-distributed) is a
more conservative approach to this. my reason for suggesting limiting
this to standalone is that even in pseudo-distributed mode the need for
managing hdfs and zookeeper are present, i wanted to highlight some of
of the overlap and the issues that will start to creep in surrounding
>> as the hbase installation grows beyond the standalone mode there will
>> necessarily need to be hdfs and zookeeper support to allow for a proper
>> production deployment. this also brings up questions of allowing the end-
>> users to supply configurations for the hdfs and zookeeper processes, not to
>> mention enabling support for high availability hdfs.
> These are things that Trove already addresses, albeit in a different way than Sahara. Users can, as it turns out, specify configuration groups which can then be used to launch new instances, and can also be associated with groups of instances.
i am merely identifying issues that trove will need to reproduce, i'm
not deeply familiar with the configuration options that trove exposes
but i am guessing that it is currently not generating the configurations
specific to hdfs and zookeeper.
>> i can envision a scenario where trove could use sahara to provision and
>> manage the clusters for hbase/hdfs/zk. this does pose some questions as
>> we'd have to determine how the trove guest agent would be installed on the
>> nodes, if there will need to be custom configurations used by trove, and if
>> sahara will need to provide a plugin for bare (meaning no data processing
>> framework) hbase/hdfs/zk clusters. but, i think these could be solved by
>> either using custom images or a plugin in sahara that would install the
>> necessary agents/configurations.
> Let us not underestimate the effort for an end user to now deploy one more project. To a user already using Trove for a myriad of databases, requiring Sahara for supporting HBase Standalone sounds (to put it bluntly) a burden. Requiring it for Fully-Distributed mode may have some development benefits but it remains to be seen whether those benefits are really worth the contortions that Trove would have to go through. And in the Trove architecture, there is flexibility as described above to have multiple possible implementations for fully-distributed, one that would interface with Sahara and another that didn't have to.
i agree about the installation issues when we are talking about
standalone versus distributed. as for the contortions that trove may
have to go through to integrate with sahara, i think it would be worth
it, but i'm probably biased here ;)
> Let's be clear that for a person who wants a fully configurable Hadoop based deployment with more control, Sahara may be the best option. And to one who wants even more control, maybe doing it themselves with Nova and customer Glance Images is the way to go. Similarly, a Database-as-a-Service comes with the understood boundaries imposed by the "as-a-Service" deployment. Not all configuration options may be tweakable with a DBaaS, that's well known an understood, not just in Trove but also, for example, in Amazon RDS, RedShift or any of the other database-as-a-service implementations. The same would be true in fully-distributed as well, in the proposal that is currently under review. I submit to you that this nuance is being lost in your reading.
i'd like to think that for someone who wants a fully configurable hadoop
base deployment, sahara is the best option =)
i think we generally agree here about the deployment of "-aaS" services
in openstack, and again i disagree with your characterization of my
>> of course, this does add a layer of complexity as operators who wish this type
>> of deployment will need to have both trove and sahara, but imo this would
>> be easier than replicating the work that sahara has done with these
> I think this is where our opinions differ, as the 'replication' isn't all that much given the fact that Trove already provides capabilities to cluster databases. But, with that said, nothing in the current specification locks us into a specific deployment strategy in the future, nor does it preclude multiple implementations of fully-distributed, one which could leverage Sahara and one which didn't.
respectfully, i think there is more effort involved with the management
of the pseudo-distributed mode than standalone, and that is more where
my comments are oriented towards. mind you, provisioning might be a
simple matter for trove as it stands now, but i think the potential for
issues could get deeper with pseudo-distributed.
i'm glad that you are open to the idea of implementations that may
involve other projects (namely sahara) in the future. as i said in the
beginning, given the comments about sahara in the spec and the review i
wanted to make sure we got a few more eyes on this to bring our
experience to the table.
More information about the OpenStack-dev