[openstack-dev] [TaskFlow] TaskFlow persistence: Job failure retry
Joshua Harlow
harlowja at fastmail.com
Mon Jun 6 03:56:47 UTC 2016
Cool, we'll feel free to find the taskflow (and others) either in
#openstack-oslo or #openstack-state-management if you have any questions.
-Josh
pnkk wrote:
> I am working on NFV orchestrator based on MANO
>
> Regards,
> Kanthi
>
> On Thu, Jun 2, 2016 at 3:00 AM, Joshua Harlow <harlowja at fastmail.com
> <mailto:harlowja at fastmail.com>> wrote:
>
> Interesting way to combine taskflow + celery.
>
> I didn't expect it to be used like this, but the more power to you!
>
> Taskflow itself has some similar capabilities via
> http://docs.openstack.org/developer/taskflow/workers.html#design but
> anyway what u've done is pretty neat as well.
>
> I am assuming this isn't an openstack project (due to usage of
> celery), any details on what's being worked on (am curious here)?
>
> pnkk wrote:
>
> Thanks for the nice documentation.
>
> To my knowledge celery is widely used for distributed task
> processing.
> This fits our requirement perfectly where we want to return
> immediate
> response to the user from our API server and run long running
> task in
> background. Celery also gives flexibility with the worker
> types(process(can overcome GIL problems too)/evetlet...) and it also
> provides nice message brokers(rabbitmq,redis...)
>
> We used both celery and taskflow for our core processing to
> leverage the
> benefits of both. Taskflow provides nice primitives like(execute,
> revert, pre,post stuf) which takes off the load from the
> application.
>
> As far as the actual issue is concerned, I found one way to
> solve it by
> using celery "retry" option. This along with late_acks makes the
> application highly fault tolerant.
>
> http://docs.celeryproject.org/en/latest/faq.html#faq-acks-late-vs-retry
>
> Regards,
> Kanthi
>
>
> On Sat, May 28, 2016 at 1:51 AM, Joshua Harlow
> <harlowja at fastmail.com <mailto:harlowja at fastmail.com>
> <mailto:harlowja at fastmail.com <mailto:harlowja at fastmail.com>>>
> wrote:
>
> Seems like u could just use
> http://docs.openstack.org/developer/taskflow/jobs.html (it appears
> that you may not be?); the job itself would when failed be then
> worked on by a different job consumer.
>
> Have u looked at those? It almost appears that u are using
> celery as
> a job distribution system (similar to the jobs.html link
> mentioned
> above)? Is that somewhat correct (I haven't seen anyone try
> this,
> wondering how u are using it and the choices that directed u to
> that, aka, am curious)?
>
> -Josh
>
> pnkk wrote:
>
> To be specific, we hit this issue when the node running our
> service is
> rebooted.
> Our solution is designed in a way that each and every
> job is a
> celery
> task and inside celery task, we create taskflow flow.
>
> We enabled late_acks in celery(uses rabbitmq as message
> broker),
> so if
> our service/node goes down, other healthy service can
> pick the
> job and
> completes it.
> This works fine, but we just hit this rare case where
> the node was
> rebooted just when taskflow is updating something to
> the database.
>
> In this case, it raises an exception and the job is marked
> failed. Since
> it is complete(with failure), message is removed from the
> rabbitmq and
> other worker would not be able to process it.
> Can taskflow handle such I/O errors gracefully or should
> application try
> to catch this exception? If application has to handle
> it what would
> happen to that particular database transaction which
> failed just
> when
> the node is rebooted? Who will retry this transaction?
>
> Thanks,
> Kanthi
>
> On Fri, May 27, 2016 at 5:39 PM, pnkk
> <pnkk2016 at gmail.com <mailto:pnkk2016 at gmail.com>
> <mailto:pnkk2016 at gmail.com <mailto:pnkk2016 at gmail.com>>
> <mailto:pnkk2016 at gmail.com <mailto:pnkk2016 at gmail.com>
> <mailto:pnkk2016 at gmail.com <mailto:pnkk2016 at gmail.com>>>> wrote:
>
> Hi,
>
> When taskflow engine is executing a job, the execution
> failed due to
> IO error(traceback pasted below).
>
> 2016-05-25 19:45:21.717 7119 ERROR
> taskflow.engines.action_engine.engine 127.0.1.1
> [-] Engine
> execution has failed, something bad must of
> happened (last 10
> machine transitions were [('SCHEDULING', 'WAITING'),
> ('WAITING',
> 'ANALYZING'), ('ANALYZING', 'SCHEDULING'), ('SCHEDULING',
> 'WAITING'), ('WAITING', 'ANALYZING'), ('ANALYZING', 'SCHEDULING'),
> ('SCHEDULING', 'WAITING'), ('WAITING', 'ANALYZING'),
> ('ANALYZING',
> 'GAME_OVER'), ('GAME_OVER', 'FAILURE')])
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine Traceback (most
> recent call last):
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/engine.py",
> line 269, in run_iter
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> failure.Failure.reraise_if_any(memory.failures)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/types/failure.py",
> line 336, in reraise_if_any
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> failures[0].reraise()
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/types/failure.py",
> line 343, in reraise
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> six.reraise(*self._exc_info)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/scheduler.py",
> line 94, in schedule
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> futures.add(scheduler.schedule(atom))
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/scheduler.py",
> line 67, in schedule
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine return
> self._task_action.schedule_execution(task)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/actions/task.py",
> line 99, in schedule_execution
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> self.change_state(task,
> states.RUNNING, progress=0.0)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/actions/task.py",
> line 67, in change_state
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> self._storage.set_atom_state(task.name
> <http://task.name> <http://task.name>
> <http://task.name>, state)
>
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/fasteners/lock.py",
> line 85, in wrapper
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine return
> f(self, *args,
> **kwargs)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
> line 486, in set_atom_state
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> self._with_connection(self._save_atom_detail,
> source, clone)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
> line 341, in _with_connection
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine return
> functor(conn,
> *args, **kwargs)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
> line 471, in _save_atom_detail
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
>
>
> original_atom_detail.update(conn.update_atom_details(atom_detail))
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/persistence/backends/impl_sqlalchemy.py",
> line 427, in update_atom_details
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine row =
> conn.execute(q).first()
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 914, in execute
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine return
> meth(self,
> multiparams, params)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
> line 323, in _execute_on_connection
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine return
> connection._execute_clauseelement(self,
> multiparams, params)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 1003, in _execute_clauseelement
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> inline=len(distilled_params) > 1)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "<string>",
> line 1, in
> <lambda>
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
> line 494, in compile
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine return
> self._compiler(dialect, bind=bind, **kw)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
> line 500, in _compiler
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine return
> dialect.statement_compiler(dialect, self, **kw)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
> line 392, in __init__
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> Compiled.__init__(self,
> dialect, statement, **kwargs)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
> line 190, in __init__
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> self.string =
> self.process(self.statement, **compile_kwargs)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
> line 213, in process
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine return
> obj._compiler_dispatch(self, **kwargs)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py",
> line 81, in _compiler_dispatch
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine return
> meth(self,
> **kw)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
> line 1579, in visit_select
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine for
> name, column in
> select._columns_plus_names
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
> line 1347, in _label_select_column
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> add_to_result_map=add_to_result_map
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/celery/apps/worker.py",
> line 288, in _handle_request
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> safe_say('worker: {0}
> shutdown (MainProcess)'.format(how))
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/celery/apps/worker.py",
> line 73, in safe_say
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
> print('\n{0}'.format(msg),
> file=sys.__stderr__)
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine IOError:
> [Errno 5]
> Input/output error
> 2016-05-25 19:45:21.717 7119 TRACE
> taskflow.engines.action_engine.engine
>
> There could be a transient network issue which
> prevents
> taskflow
> from reaching the mysql node.
> Can you please suggest a graceful way of handling
> it and
> continue
> processing the execution?
>
> Thanks,
> Kanthi
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage
> questions)
> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe:
> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list