[openstack-dev] [Sahara] Spark plugin: EDP and Spark jobs

Trevor McKay tmckay at redhat.com
Fri Jun 6 15:54:36 UTC 2014


Thanks Daniele,

  This is a good summary (also pasted more or less on the etherpad).

> There is also a third possibility of bypassing the job-server problem
> and call directly Spark commands on the master node of the cluster.

  I am starting to play around with this idea as a simple
proof-of-concept. I think it can be done along with some refactoring in
the Sahara job manager.

  I think the refactoring is viable and can give us something where the
details of "run, status, kill" can be hidden behind another common
interface.  If this proves to be viable, we can pursue more capable
spark job models next.

We shall see!  Learn by doing.  I should have a CR in a few days.

Best,

Trevor

On Fri, 2014-06-06 at 10:54 +0200, Daniele Venzano wrote:
> Dear all,
> 
> A short while ago the Spark plugin for Sahara was merged, opening up the
> possibility of deploying Spark clusters with one click from OpenStack.
> Since Spark is quite different from Hadoop, we need to take a number of
> decisions on how to proceed implementing important features, like, in
> particular, EDP. Spark does not have a built-in job-server and EDP needs
> a way to have a very generic and high level interface to submit, check
> the basic status and kill a job.
> 
> In summary, this is our understanding of the current situation:
> 1. a quick hack is to use Oozie for application submission (this mimics
> what Cloudera did by the end of last year, when preparing to announce
> the integration of Spark in CDH)
> 2. an alternative is to use a spark job-server, which should replace Oozie
> (there is a repo on github from ooyala that implements an instance of a
> job-server)
> 
> Here's our view on the points above:
> 1. clearly, the first approach is an "ugly" hack, that creates
> dependencies with Oozie. Oozie requires mapreduce, and uses tiny
> map-only jobs to submit part of a larger workflow. Besides dependencies,
> this is a bazooka to kill a fly, as we're not addressing spark
> application workflows right now
> 2. the spark job-server idea is more clean, but the current project from
> Ooyala supports an old version of spark. Spark 1.0.0 (which we have
> already tested in Sahara and that we will commit soon) offers some new
> methods to submit and package applications, that can drastically
> simplify the "job-server"
> 
> As a consequence, the doubt is: do we contribute to that project, create
> a new one, or contribute directly to spark?
> 
> A few more points:
> - assuming we have a working prototype of 2), we need to modify the
> Sahara setup such that it deploys, in addition to the usual suspects
> (master and slaves) one more service, the spark job-server
> 
> There is also a third possibility of bypassing the job-server problem
> and call directly Spark commands on the master node of the cluster.



> One last observation: currently, spark in standalone mode (that we use
> in the plugin) does not support other schedulers than FIFO, when
> multiple spark applications/jobs are submitted to the cluster. Hence,
> the spark job-server could be a good place to integrate a better job
> scheduler.
> 
> Trevor McKay opened a pad here:
> https://etherpad.openstack.org/p/sahara_spark_edp
> 
> to gather ideas and feedback. This email is based on the very
> preliminary discussion that happened yesterday via IRC, email and the
> above-mentioned etherpad and has the objective of starting a public
> discussion on how to proceed.
> 





More information about the OpenStack-dev mailing list