Open Stack

Tue Jan 28 16:47:07 UTC 2014

Hello all,

In our first pass at EDP, the model for job settings was very consistent
across all of our job types. The execution-time settings fit into this
(superset) structure:

job_configs = {'configs': {}, # config settings for oozie and hadoop
	       'params': {},  # substitution values for Pig/Hive
	       'args': []}    # script args (Pig and Java actions)

But we have some things that don't fit (and probably more in the
future):

1) Java jobs have 'main_class' and 'java_opts' settings
   Currently these are handled as additional fields added to the
structure above.  These were the first to diverge.

2) Streaming MapReduce (anticipated) requires mapper and reducer
settings (different than the mapred.xxxx.class settings for
non-streaming MapReduce)

Problems caused by adding fields
--------------------------------
The job_configs structure above is stored in the database. Each time we
add a field to the structure above at the level of configs, params, and
args, we force a change to the database tables, a migration script and a
change to the JSON validation for the REST api.

We also cause a change for python-savannaclient and potentially other
clients.

This kind of change seems bad.

Proposal: Borrow a page from Oozie and add "savanna." configs
-------------------------------------------------------------
I would like to fit divergent job settings into the structure we already
have.  One way to do this is to leverage the 'configs' dictionary.  This
dictionary primarily contains settings for hadoop, but there are a
number of "oozie.xxx" settings that are passed to oozie as configs or
set by oozie for the benefit of running apps.

What if we allow "savanna." settings to be added to configs?  If we do
that, any and all special configuration settings for specific job types
or subtypes can be handled with no database changes and no api changes.

Downside
--------
Currently, all 'configs' are rendered in the generated oozie workflow.
The "savanna." settings would be stripped out and processed by Savanna,
thereby changing that behavior a bit (maybe not a big deal)

We would also be mixing "savanna." configs with config_hints for jobs,
so users would potentially see "savanna.xxxx" settings mixed with oozie
and hadoop settings.  Again, maybe not a big deal, but it might blur the
lines a little bit.  Personally, I'm okay with this.

Slightly different
------------------
We could also add a "'savanna-configs': {}" element to job_configs to
keep the configuration spaces separate.

But, now we would have 'savanna-configs' (or another name), 'configs',
'params', and 'args'.  Really? Just how many different types of values
can we come up with? :)

I lean away from this approach.

Related: breaking up the superset
---------------------------------

It is also the case that not every job type has every value type.

             Configs   Params    Args
Hive            Y         Y        N
Pig             Y         Y        Y
MapReduce       Y         N        N
Java            Y         N        Y

So do we make that explicit in the docs and enforce it in the api with
errors?

Thoughts? I'm sure there are some :)

Best,

Trevor

Open Stack

[openstack-dev] [savanna] How to handle diverging EDP job configuration settings

OpenStack

Community

Documentation

Branding & Legal