[openstack-dev] TC Meeting / Savanna Incubation Follow-Up

Sudarshan Acharya sudarshan.acharya at gmail.com
Thu Sep 12 16:07:42 UTC 2013


On Sep 12, 2013, at 10:30 AM, Michael Basnight wrote:

> On Sep 12, 2013, at 2:39 AM, Thierry Carrez wrote:
> 
>> Sergey Lukjanov wrote:
>> 
>>> [...]
>>> As you can see, resources provisioning is just one of the features and the implementation details are not critical for overall architecture. It performs only the first step of the cluster setup. We’ve been considering Heat for a while, but ended up direct API calls in favor of speed and simplicity. Going forward Heat integration will be done by implementing extension mechanism [3] and [4] as part of Icehouse release.
>>> 
>>> The next part, Hadoop cluster configuration, already extensible and we have several plugins - Vanilla, Hortonworks Data Platform and Cloudera plugin started too. This allow to unify management of different Hadoop distributions under single control plane. The plugins are responsible for correct Hadoop ecosystem configuration at already provisioned resources and use different Hadoop management tools like Ambari to setup and configure all cluster  services, so, there are no actual provisioning configs on Savanna side in this case. Savanna and its plugins encapsulate the knowledge of Hadoop internals and default configuration for Hadoop services.
>> 
>> My main gripe with Savanna is that it combines (in its upcoming release)
>> what sounds like to me two very different services: Hadoop cluster
>> provisioning service (like what Trove does for databases) and a
>> MapReduce+ data API service (like what Marconi does for queues).
>> 
>> Making it part of the same project (rather than two separate projects,
>> potentially sharing the same program) make discussions about shifting
>> some of its clustering ability to another library/project more complex
>> than they should be (see below).
>> 
>> Could you explain the benefit of having them within the same service,
>> rather than two services with one consuming the other ?
> 
> And for the record, i dont think that Trove is the perfect fit for it today. We are still working on a clustering API. But when we create it, i would love the Savanna team's input, so we can try to make a pluggable API thats usable for people who want MySQL or Cassandra or even Hadoop. Im less a fan of a clustering library, because in the end, we will both have API calls like POST /clusters, GET /clusters, and there will be API duplication between the projects.


+1. I am looking at the new cluster provisioning API in Trove [1] and the one in Savanna [2], and they look quite different right now. Definitely some collaboration is needed even the API spec, not just the backend.

[1] https://wiki.openstack.org/wiki/Trove-Replication-And-Clustering-API#POST_.2Fclusters
[2] https://savanna.readthedocs.org/en/latest/userdoc/rest_api_v1.0.html#start-cluster


> 
>> 
>>> The next topic is “Cluster API”.
>>> 
>>> The concern that was raised is how to extract general clustering functionality to the common library. Cluster provisioning and management topic currently relevant for a number of projects within OpenStack ecosystem: Savanna, Trove, TripleO, Heat, Taskflow.
>>> 
>>> Still each of the projects has their own understanding of what the cluster provisioning is. The idea of extracting common functionality sounds reasonable, but details still need to be worked out. 
>>> 
>>> I’ll try to highlight Savanna team current perspective on this question. Notion of “Cluster management” in my perspective has several levels:
>>> 1. Resources provisioning and configuration (like instances, networks, storages). Heat is the main tool with possibly additional support from underlying services. For example, instance grouping API extension [5] in Nova would be very useful. 
>>> 2. Distributed communication/task execution. There is a project in OpenStack ecosystem with the mission to provide a framework for distributed task execution - TaskFlow [6]. It’s been started quite recently. In Savanna we are really looking forward to use more and more of its functionality in I and J cycles as TaskFlow itself getting more mature.
>>> 3. Higher level clustering - management of the actual services working on top of the infrastructure. For example, in Savanna configuring HDFS data nodes or in Trove setting up MySQL cluster with Percona or Galera. This operations are typical very specific for the project domain. As for Savanna specifically, we use lots of benefits of Hadoop internals knowledge to deploy and configure it properly.
>>> 
>>> Overall conclusion it seems to be that it make sense to enhance Heat capabilities and invest in Taskflow development, leaving domain-specific operations to the individual projects.
>> 
>> The thing we'd need to clarify (and the incubation period would be used
>> to achieve that) is how to reuse as much as possible between the various
>> cluster provisioning projects (Trove, the cluster side of Savanna, and
>> possibly future projects). Solution can be to create a library used by
>> Trove and Savanna, to extend Heat, to make Trove the clustering thing
>> beyond just databases...
>> 
>> One way of making sure smart and non-partisan decisions are taken in
>> that area would be to make Trove and Savanna part of the same program,
>> or make the clustering part of Savanna part of the same program as
>> Trove, while the data API part of Savanna could live separately (hence
>> my question about two different projects vs. one project above).
> 
> Trove is not, nor will be, a data API. Id like to keep Savanna in its own program, but I could easily see them as being a big data / data processing program, while Trove is a cluster provisioning / scaling / administration / "keep it online" program.
> 
>> 
>>> I also would like to emphasize that in Savanna Hadoop cluster management is already implemented including scaling support.
>>> 
>>> With all this I do believe Savanna fills an important gap in OpenStack by providing Data Processing capabilities in cloud environment in general and integration with Hadoop ecosystem as the first particular step. 
>> 
>> For incubation we bless the goal of the project and the promise that it
>> will integrate well with the other existing projects. A
>> perfectly-working project can stay in incubation until it achieves
>> proper integration and avoids duplication of functionality with other
>> integrated projects. A perfectly-working project can also happily live
>> outside of OpenStack integrated release if it prefers a more standalone
>> approach.
> 
> A good example. Our instance provisioning was also implemented in Trove, but the goal is to use Heat. So the TC asked us to use Heat for instance provisioning, and we outlined a set of goals to achieve before we went to Integrated status.
> 
>> I think there is value in having Savanna in incubation so that we can
>> explore those avenues of collaboration between projects. It may take
>> more than one cycle of incubation to get it right (in fact, I would not
>> be surprised at all if it took us more than one cycle to properly
>> separate the roles between Trove / Taskflow / heat / clusterlib). During
>> this exploration, Savanna devs may also decide that integration is very
>> costly and that their immediate time is better spent adding key
>> features, and drop from the incubation track. But in all cases,
>> incubation sounds like the right first step to get everyone around the
>> same table.
>> 
>> -- 
>> Thierry Carrez (ttx)
>> 
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20130912/09f80899/attachment.html>


More information about the OpenStack-dev mailing list