Open Stack

Wed Jan 29 10:35:52 UTC 2014

Thank you for bringing this up, Trevor.

EDP gets more diverse and it's time to change its model.
I totally agree with your proposal, but one minor comment.
Instead of "savanna." prefix in job_configs wouldn't it be better to make it
as "edp."? I think "savanna." is too more wide word for this.

And one more bureaucratic thing... I see you already started implementing it [1], 
and it is named and goes as new EDP workflow [2]. I think new bluprint should be 
created for this feature to track all code changes as well as docs updates. 
Docs I mean public Savanna docs about EDP, rest api docs and samples.

[1] https://review.openstack.org/#/c/69712
[2] https://blueprints.launchpad.net/openstack/?searchtext=edp-oozie-streaming-mapreduce

Regards,
Alexander Ignatov

On 28 Jan 2014, at 20:47, Trevor McKay <tmckay at redhat.com> wrote:

> Hello all,
> 
> In our first pass at EDP, the model for job settings was very consistent
> across all of our job types. The execution-time settings fit into this
> (superset) structure:
> 
> job_configs = {'configs': {}, # config settings for oozie and hadoop
> 	       'params': {},  # substitution values for Pig/Hive
> 	       'args': []}    # script args (Pig and Java actions)
> 
> But we have some things that don't fit (and probably more in the
> future):
> 
> 1) Java jobs have 'main_class' and 'java_opts' settings
>   Currently these are handled as additional fields added to the
> structure above.  These were the first to diverge.
> 
> 2) Streaming MapReduce (anticipated) requires mapper and reducer
> settings (different than the mapred.xxxx.class settings for
> non-streaming MapReduce)
> 
> Problems caused by adding fields
> --------------------------------
> The job_configs structure above is stored in the database. Each time we
> add a field to the structure above at the level of configs, params, and
> args, we force a change to the database tables, a migration script and a
> change to the JSON validation for the REST api.
> 
> We also cause a change for python-savannaclient and potentially other
> clients.
> 
> This kind of change seems bad.
> 
> Proposal: Borrow a page from Oozie and add "savanna." configs
> -------------------------------------------------------------
> I would like to fit divergent job settings into the structure we already
> have.  One way to do this is to leverage the 'configs' dictionary.  This
> dictionary primarily contains settings for hadoop, but there are a
> number of "oozie.xxx" settings that are passed to oozie as configs or
> set by oozie for the benefit of running apps.
> 
> What if we allow "savanna." settings to be added to configs?  If we do
> that, any and all special configuration settings for specific job types
> or subtypes can be handled with no database changes and no api changes.
> 
> Downside
> --------
> Currently, all 'configs' are rendered in the generated oozie workflow.
> The "savanna." settings would be stripped out and processed by Savanna,
> thereby changing that behavior a bit (maybe not a big deal)
> 
> We would also be mixing "savanna." configs with config_hints for jobs,
> so users would potentially see "savanna.xxxx" settings mixed with oozie
> and hadoop settings.  Again, maybe not a big deal, but it might blur the
> lines a little bit.  Personally, I'm okay with this.
> 
> Slightly different
> ------------------
> We could also add a "'savanna-configs': {}" element to job_configs to
> keep the configuration spaces separate.
> 
> But, now we would have 'savanna-configs' (or another name), 'configs',
> 'params', and 'args'.  Really? Just how many different types of values
> can we come up with? :)
> 
> I lean away from this approach.
> 
> Related: breaking up the superset
> ---------------------------------
> 
> It is also the case that not every job type has every value type.
> 
>             Configs   Params    Args
> Hive            Y         Y        N
> Pig             Y         Y        Y
> MapReduce       Y         N        N
> Java            Y         N        Y
> 
> So do we make that explicit in the docs and enforce it in the api with
> errors?
> 
> Thoughts? I'm sure there are some :)
> 
> Best,
> 
> Trevor
> 
> 
> 
> 
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Open Stack

[openstack-dev] [savanna] How to handle diverging EDP job configuration settings

OpenStack

Community

Documentation

Branding & Legal