[Openstack] [Sahara] Jobs get marked as "failed" immediately on Spark cluster

Vitaly Gridnev vgridnev at mirantis.com
Tue Aug 9 17:01:19 UTC 2016


I'm not quite sure but actually Liberty branch is open only for critical
and security issues right now, so I think that we can't include that to
liberty branch right now. Manual patching I think is best option for you, I
think.

On Tue, Aug 9, 2016 at 6:15 PM, Jeremy Freudberg <jfreud at bu.edu> wrote:

> Hey Vitaly,
>
> I solved the issue. As you pointed out,
> https://github.com/openstack/sahara/blob/master/sahara/
> service/edp/job_manager.py#L124
> was quite relevant.
>
> However, you linked to version of this file on master branch. Liberty
> branch file looks a little different:
> https://github.com/openstack/sahara/blob/stable/liberty/
> sahara/service/edp/job_manager.py#L115
>
> So, fix is already there. It is just only there in master and Mitaka
> branches. Maybe we can backport this fix to Liberty EOL release?
>
> Thanks for your help,
> Jeremy Freudberg
>
> On Mon, Aug 8, 2016 at 6:05 PM, Vitaly Gridnev <vgridnev at mirantis.com>
> wrote:
> > Hello.
> >
> > I haven't saw issues like that. Can you explain more precisely how you
> are
> > running job, what configs and arguments are passed, job binaries, and so
> on.
> > Minimal example for reproducing this issue should be a best option.
> >
> > Moreover, this looks like a absolutely strange issue (if job was executed
> > successfully). It means that launching command was successful (see [0]).
> But
> > there is no copy operations after launching command.
> >
> > [0]
> > https://github.com/openstack/sahara/blob/stable/liberty/
> sahara/service/edp/spark/engine.py#L339
> >
> > On Mon, Aug 8, 2016 at 11:02 PM, Jeremy Freudberg <jfreud at bu.edu> wrote:
> >>
> >> Hi all, I am experiencing a strange bug running jobs on Sahara (Red
> >> Hat Liberty).
> >>
> >> When submitting a job to a Spark 1.3.1 cluster, I get the following
> >> error immediately:
> >>
> >> 2016-08-08 15:56:09.546 20949 WARNING sahara.service.edp.job_manager
> >> [req-fb5b47
> >> 22-861a-4063-bc22-5e96b417376c ] [instance: none, job_execution:
> >> ee747ffb-9be5-4
> >> 5b0-aa0b-c719668a43aa] Can't run job execution (reason: '__deepcopy__')
> >>
> >> However, even though the job is marked as failed in Sahara API and
> >> dashboard, the job still runs and succeeds on the cluster. (i.e. I see
> >> the results in Swift/HDFS).
> >>
> >> I only experience this behavior on Spark clusters (no other plugins)
> >> but it does affect all job types. (Even simple ones like Shell).
> >>
> >> Any help is greatly appreciated.
> >>
> >> Thanks,
> >> Jeremy Freudberg
> >>
> >> _______________________________________________
> >> Mailing list:
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> >> Post to     : openstack at lists.openstack.org
> >> Unsubscribe :
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> >
> >
> >
> >
> > --
> > Best Regards,
> > Vitaly Gridnev,
> > Project Technical Lead of OpenStack DataProcessing Program (Sahara)
> > Mirantis, Inc
>



-- 
Best Regards,
Vitaly Gridnev,
Project Technical Lead of OpenStack DataProcessing Program (Sahara)
Mirantis, Inc
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20160809/4737ae10/attachment.html>


More information about the Openstack mailing list