[openstack-dev] [Sahara] Spark plugin: EDP and Spark jobs

Daniele Venzano venza at brownhat.org
Fri Jun 6 08:54:44 UTC 2014


Dear all,

A short while ago the Spark plugin for Sahara was merged, opening up the
possibility of deploying Spark clusters with one click from OpenStack.
Since Spark is quite different from Hadoop, we need to take a number of
decisions on how to proceed implementing important features, like, in
particular, EDP. Spark does not have a built-in job-server and EDP needs
a way to have a very generic and high level interface to submit, check
the basic status and kill a job.

In summary, this is our understanding of the current situation:
1. a quick hack is to use Oozie for application submission (this mimics
what Cloudera did by the end of last year, when preparing to announce
the integration of Spark in CDH)
2. an alternative is to use a spark job-server, which should replace Oozie
(there is a repo on github from ooyala that implements an instance of a
job-server)

Here's our view on the points above:
1. clearly, the first approach is an "ugly" hack, that creates
dependencies with Oozie. Oozie requires mapreduce, and uses tiny
map-only jobs to submit part of a larger workflow. Besides dependencies,
this is a bazooka to kill a fly, as we're not addressing spark
application workflows right now
2. the spark job-server idea is more clean, but the current project from
Ooyala supports an old version of spark. Spark 1.0.0 (which we have
already tested in Sahara and that we will commit soon) offers some new
methods to submit and package applications, that can drastically
simplify the "job-server"

As a consequence, the doubt is: do we contribute to that project, create
a new one, or contribute directly to spark?

A few more points:
- assuming we have a working prototype of 2), we need to modify the
Sahara setup such that it deploys, in addition to the usual suspects
(master and slaves) one more service, the spark job-server

There is also a third possibility of bypassing the job-server problem
and call directly Spark commands on the master node of the cluster.

One last observation: currently, spark in standalone mode (that we use
in the plugin) does not support other schedulers than FIFO, when
multiple spark applications/jobs are submitted to the cluster. Hence,
the spark job-server could be a good place to integrate a better job
scheduler.

Trevor McKay opened a pad here:
https://etherpad.openstack.org/p/sahara_spark_edp

to gather ideas and feedback. This email is based on the very
preliminary discussion that happened yesterday via IRC, email and the
above-mentioned etherpad and has the objective of starting a public
discussion on how to proceed.




More information about the OpenStack-dev mailing list