Open Stack

Wed Sep 11 20:16:18 UTC 2013

Hi folks,

Initial discussions of Savanna Incubation request have been started yesterday. Two major topics being discussed were Heat integration and “clustering library” [1].

To start with let me give a brief overview of key Savanna features:
1. Provisioning of underlying OpenStack resources (like compute, volume, network) required for Hadoop cluster.
2. Hadoop cluster deployment and configuration.
3. Integration with different Hadoop distributions through plugin mechanism with single control plan for all of them. In future can be used to integrate with other Data Processing frameworks, for example, Twitter Storm.
4. Reliability and performance optimizations to ensure Hadoop cluster performance on top of OpenStack, like enabling Swift to be used as underlying HDFS and exposing information on Swift data locality to Hadoop scheduler.
5. Set of Elastic Data Processing features:
  * Hadoop jobs on-demand execution
  * Pool of different external data sources, like Swift, external Hadoop cluster, NoSQL and traditional databases
  * Pig and Hive integration
6. OpenStack Dashboard plugin for all above.

I highly recommend to view our screencast about Savanna 0.2 release (mid July) [2] to better understand Savanna functionality. 

As you can see, resources provisioning is just one of the features and the implementation details are not critical for overall architecture. It performs only the first step of the cluster setup. We’ve been considering Heat for a while, but ended up direct API calls in favor of speed and simplicity. Going forward Heat integration will be done by implementing extension mechanism [3] and [4] as part of Icehouse release.

The next part, Hadoop cluster configuration, already extensible and we have several plugins - Vanilla, Hortonworks Data Platform and Cloudera plugin started too. This allow to unify management of different Hadoop distributions under single control plane. The plugins are responsible for correct Hadoop ecosystem configuration at already provisioned resources and use different Hadoop management tools like Ambari to setup and configure all cluster  services, so, there are no actual provisioning configs on Savanna side in this case. Savanna and its plugins encapsulate the knowledge of Hadoop internals and default configuration for Hadoop services.

The next topic is “Cluster API”.

The concern that was raised is how to extract general clustering functionality to the common library. Cluster provisioning and management topic currently relevant for a number of projects within OpenStack ecosystem: Savanna, Trove, TripleO, Heat, Taskflow.

Still each of the projects has their own understanding of what the cluster provisioning is. The idea of extracting common functionality sounds reasonable, but details still need to be worked out. 

I’ll try to highlight Savanna team current perspective on this question. Notion of “Cluster management” in my perspective has several levels:
1. Resources provisioning and configuration (like instances, networks, storages). Heat is the main tool with possibly additional support from underlying services. For example, instance grouping API extension [5] in Nova would be very useful. 
2. Distributed communication/task execution. There is a project in OpenStack ecosystem with the mission to provide a framework for distributed task execution - TaskFlow [6]. It’s been started quite recently. In Savanna we are really looking forward to use more and more of its functionality in I and J cycles as TaskFlow itself getting more mature.
3. Higher level clustering - management of the actual services working on top of the infrastructure. For example, in Savanna configuring HDFS data nodes or in Trove setting up MySQL cluster with Percona or Galera. This operations are typical very specific for the project domain. As for Savanna specifically, we use lots of benefits of Hadoop internals knowledge to deploy and configure it properly.

Overall conclusion it seems to be that it make sense to enhance Heat capabilities and invest in Taskflow development, leaving domain-specific operations to the individual projects.

I also would like to emphasize that in Savanna Hadoop cluster management is already implemented including scaling support.

With all this I do believe Savanna fills an important gap in OpenStack by providing Data Processing capabilities in cloud environment in general and integration with Hadoop ecosystem as the first particular step. 

Hadoop ecosystem on its own is huge and integration will add significant value to OpenStack community and users [7].

[1] http://eavesdrop.openstack.org/meetings/tc/2013/tc.2013-09-10-20.02.log.html
[2] http://www.youtube.com/watch?v=SrlHM0-q5zI
[3] https://blueprints.launchpad.net/savanna/+spec/infra-provisioning-extensions
[4] https://blueprints.launchpad.net/savanna/+spec/heat-backed-resources-provisioning
[5] https://blueprints.launchpad.net/nova/+spec/instance-group-api-extension
[6] https://launchpad.net/taskflow
[7]http://www.google.com/trends/explore?q=openstack%2Chadoop#q=openstack%2C%20hadoop&cmpt=q

Sincerely yours,
Sergey Lukjanov
Savanna Technical Lead
Mirantis Inc.

Open Stack

[openstack-dev] TC Meeting / Savanna Incubation Follow-Up

OpenStack

Community

Documentation

Branding & Legal