[Openstack] [Sahara] Jobs get marked as "failed" immediately on Spark cluster

Jeremy Freudberg jfreud at bu.edu
Tue Aug 9 15:15:25 UTC 2016


Hey Vitaly,

I solved the issue. As you pointed out,
https://github.com/openstack/sahara/blob/master/sahara/service/edp/job_manager.py#L124
was quite relevant.

However, you linked to version of this file on master branch. Liberty
branch file looks a little different:
https://github.com/openstack/sahara/blob/stable/liberty/sahara/service/edp/job_manager.py#L115

So, fix is already there. It is just only there in master and Mitaka
branches. Maybe we can backport this fix to Liberty EOL release?

Thanks for your help,
Jeremy Freudberg

On Mon, Aug 8, 2016 at 6:05 PM, Vitaly Gridnev <vgridnev at mirantis.com> wrote:
> Hello.
>
> I haven't saw issues like that. Can you explain more precisely how you are
> running job, what configs and arguments are passed, job binaries, and so on.
> Minimal example for reproducing this issue should be a best option.
>
> Moreover, this looks like a absolutely strange issue (if job was executed
> successfully). It means that launching command was successful (see [0]). But
> there is no copy operations after launching command.
>
> [0]
> https://github.com/openstack/sahara/blob/stable/liberty/sahara/service/edp/spark/engine.py#L339
>
> On Mon, Aug 8, 2016 at 11:02 PM, Jeremy Freudberg <jfreud at bu.edu> wrote:
>>
>> Hi all, I am experiencing a strange bug running jobs on Sahara (Red
>> Hat Liberty).
>>
>> When submitting a job to a Spark 1.3.1 cluster, I get the following
>> error immediately:
>>
>> 2016-08-08 15:56:09.546 20949 WARNING sahara.service.edp.job_manager
>> [req-fb5b47
>> 22-861a-4063-bc22-5e96b417376c ] [instance: none, job_execution:
>> ee747ffb-9be5-4
>> 5b0-aa0b-c719668a43aa] Can't run job execution (reason: '__deepcopy__')
>>
>> However, even though the job is marked as failed in Sahara API and
>> dashboard, the job still runs and succeeds on the cluster. (i.e. I see
>> the results in Swift/HDFS).
>>
>> I only experience this behavior on Spark clusters (no other plugins)
>> but it does affect all job types. (Even simple ones like Shell).
>>
>> Any help is greatly appreciated.
>>
>> Thanks,
>> Jeremy Freudberg
>>
>> _______________________________________________
>> Mailing list:
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>> Post to     : openstack at lists.openstack.org
>> Unsubscribe :
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
>
>
>
>
> --
> Best Regards,
> Vitaly Gridnev,
> Project Technical Lead of OpenStack DataProcessing Program (Sahara)
> Mirantis, Inc




More information about the Openstack mailing list