[openstack-dev] TC Meeting / Savanna Incubation Follow-Up

Thierry Carrez thierry at openstack.org
Thu Sep 12 09:39:22 UTC 2013


Sergey Lukjanov wrote:

> [...]
> As you can see, resources provisioning is just one of the features and the implementation details are not critical for overall architecture. It performs only the first step of the cluster setup. We’ve been considering Heat for a while, but ended up direct API calls in favor of speed and simplicity. Going forward Heat integration will be done by implementing extension mechanism [3] and [4] as part of Icehouse release.
> 
> The next part, Hadoop cluster configuration, already extensible and we have several plugins - Vanilla, Hortonworks Data Platform and Cloudera plugin started too. This allow to unify management of different Hadoop distributions under single control plane. The plugins are responsible for correct Hadoop ecosystem configuration at already provisioned resources and use different Hadoop management tools like Ambari to setup and configure all cluster  services, so, there are no actual provisioning configs on Savanna side in this case. Savanna and its plugins encapsulate the knowledge of Hadoop internals and default configuration for Hadoop services.

My main gripe with Savanna is that it combines (in its upcoming release)
what sounds like to me two very different services: Hadoop cluster
provisioning service (like what Trove does for databases) and a
MapReduce+ data API service (like what Marconi does for queues).

Making it part of the same project (rather than two separate projects,
potentially sharing the same program) make discussions about shifting
some of its clustering ability to another library/project more complex
than they should be (see below).

Could you explain the benefit of having them within the same service,
rather than two services with one consuming the other ?

> The next topic is “Cluster API”.
> 
> The concern that was raised is how to extract general clustering functionality to the common library. Cluster provisioning and management topic currently relevant for a number of projects within OpenStack ecosystem: Savanna, Trove, TripleO, Heat, Taskflow.
> 
> Still each of the projects has their own understanding of what the cluster provisioning is. The idea of extracting common functionality sounds reasonable, but details still need to be worked out. 
> 
> I’ll try to highlight Savanna team current perspective on this question. Notion of “Cluster management” in my perspective has several levels:
> 1. Resources provisioning and configuration (like instances, networks, storages). Heat is the main tool with possibly additional support from underlying services. For example, instance grouping API extension [5] in Nova would be very useful. 
> 2. Distributed communication/task execution. There is a project in OpenStack ecosystem with the mission to provide a framework for distributed task execution - TaskFlow [6]. It’s been started quite recently. In Savanna we are really looking forward to use more and more of its functionality in I and J cycles as TaskFlow itself getting more mature.
> 3. Higher level clustering - management of the actual services working on top of the infrastructure. For example, in Savanna configuring HDFS data nodes or in Trove setting up MySQL cluster with Percona or Galera. This operations are typical very specific for the project domain. As for Savanna specifically, we use lots of benefits of Hadoop internals knowledge to deploy and configure it properly.
> 
> Overall conclusion it seems to be that it make sense to enhance Heat capabilities and invest in Taskflow development, leaving domain-specific operations to the individual projects.

The thing we'd need to clarify (and the incubation period would be used
to achieve that) is how to reuse as much as possible between the various
cluster provisioning projects (Trove, the cluster side of Savanna, and
possibly future projects). Solution can be to create a library used by
Trove and Savanna, to extend Heat, to make Trove the clustering thing
beyond just databases...

One way of making sure smart and non-partisan decisions are taken in
that area would be to make Trove and Savanna part of the same program,
or make the clustering part of Savanna part of the same program as
Trove, while the data API part of Savanna could live separately (hence
my question about two different projects vs. one project above).

> I also would like to emphasize that in Savanna Hadoop cluster management is already implemented including scaling support.
> 
> With all this I do believe Savanna fills an important gap in OpenStack by providing Data Processing capabilities in cloud environment in general and integration with Hadoop ecosystem as the first particular step. 

For incubation we bless the goal of the project and the promise that it
will integrate well with the other existing projects. A
perfectly-working project can stay in incubation until it achieves
proper integration and avoids duplication of functionality with other
integrated projects. A perfectly-working project can also happily live
outside of OpenStack integrated release if it prefers a more standalone
approach.

I think there is value in having Savanna in incubation so that we can
explore those avenues of collaboration between projects. It may take
more than one cycle of incubation to get it right (in fact, I would not
be surprised at all if it took us more than one cycle to properly
separate the roles between Trove / Taskflow / heat / clusterlib). During
this exploration, Savanna devs may also decide that integration is very
costly and that their immediate time is better spent adding key
features, and drop from the incubation track. But in all cases,
incubation sounds like the right first step to get everyone around the
same table.

-- 
Thierry Carrez (ttx)



More information about the OpenStack-dev mailing list