[openstack-dev] [TaskFlow] TaskFlow persistence: Job failure retry

pnkk pnkk2016 at gmail.com
Fri May 27 12:41:39 UTC 2016


To be specific, we hit this issue when the node running our service is
rebooted.
Our solution is designed in a way that each and every job is a celery task
and inside celery task, we create taskflow flow.

We enabled late_acks in celery(uses rabbitmq as message broker), so if our
service/node goes down, other healthy service can pick the job and
completes it.
This works fine, but we just hit this rare case where the node was rebooted
just when taskflow is updating something to the database.

In this case, it raises an exception and the job is marked failed. Since it
is complete(with failure), message is removed from the rabbitmq and other
worker would not be able to process it.
Can taskflow handle such I/O errors gracefully or should application try to
catch this exception? If application has to handle it what would happen to
that particular database transaction which failed just when the node is
rebooted? Who will retry this transaction?

Thanks,
Kanthi

On Fri, May 27, 2016 at 5:39 PM, pnkk <pnkk2016 at gmail.com> wrote:

> Hi,
>
> When taskflow engine is executing a job, the execution failed due to IO
> error(traceback pasted below).
>
> 2016-05-25 19:45:21.717 7119 ERROR taskflow.engines.action_engine.engine
> 127.0.1.1 [-]  Engine execution has failed, something bad must of happened
> (last 10 machine transitions were [('SCHEDULING', 'WAITING'), ('WAITING',
> 'ANALYZING'), ('ANALYZING', 'SCHEDULING'), ('SCHEDULING', 'WAITING'),
> ('WAITING', 'ANALYZING'), ('ANALYZING', 'SCHEDULING'), ('SCHEDULING',
> 'WAITING'), ('WAITING', 'ANALYZING'), ('ANALYZING', 'GAME_OVER'),
> ('GAME_OVER', 'FAILURE')])
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> Traceback (most recent call last):
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/engine.py",
> line 269, in run_iter
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   failure.Failure.reraise_if_any(memory.failures)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/types/failure.py",
> line 336, in reraise_if_any
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   failures[0].reraise()
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/types/failure.py",
> line 343, in reraise
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   six.reraise(*self._exc_info)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/scheduler.py",
> line 94, in schedule
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   futures.add(scheduler.schedule(atom))
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/scheduler.py",
> line 67, in schedule
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   return self._task_action.schedule_execution(task)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/actions/task.py",
> line 99, in schedule_execution
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   self.change_state(task, states.RUNNING, progress=0.0)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/actions/task.py",
> line 67, in change_state
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   self._storage.set_atom_state(task.name, state)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/fasteners/lock.py",
> line 85, in wrapper
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   return f(self, *args, **kwargs)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
> line 486, in set_atom_state
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   self._with_connection(self._save_atom_detail, source, clone)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
> line 341, in _with_connection
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   return functor(conn, *args, **kwargs)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
> line 471, in _save_atom_detail
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   original_atom_detail.update(conn.update_atom_details(atom_detail))
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/persistence/backends/impl_sqlalchemy.py",
> line 427, in update_atom_details
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   row = conn.execute(q).first()
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 914, in execute
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   return meth(self, multiparams, params)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
> line 323, in _execute_on_connection
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   return connection._execute_clauseelement(self, multiparams, params)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 1003, in _execute_clauseelement
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   inline=len(distilled_params) > 1)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File "<string>", line 1, in <lambda>
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
> line 494, in compile
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   return self._compiler(dialect, bind=bind, **kw)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
> line 500, in _compiler
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   return dialect.statement_compiler(dialect, self, **kw)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
> line 392, in __init__
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   Compiled.__init__(self, dialect, statement, **kwargs)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
> line 190, in __init__
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   self.string = self.process(self.statement, **compile_kwargs)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
> line 213, in process
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   return obj._compiler_dispatch(self, **kwargs)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py",
> line 81, in _compiler_dispatch
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   return meth(self, **kw)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
> line 1579, in visit_select
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   for name, column in select._columns_plus_names
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
> line 1347, in _label_select_column
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   add_to_result_map=add_to_result_map
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/celery/apps/worker.py",
> line 288, in _handle_request
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   safe_say('worker: {0} shutdown (MainProcess)'.format(how))
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> File
> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/celery/apps/worker.py",
> line 73, in safe_say
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>   print('\n{0}'.format(msg), file=sys.__stderr__)
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
> IOError: [Errno 5] Input/output error
> 2016-05-25 19:45:21.717 7119 TRACE taskflow.engines.action_engine.engine
>
> There could be a transient network issue which prevents taskflow from
> reaching the mysql node.
> Can you please suggest a graceful way of handling it and continue
> processing the execution?
>
> Thanks,
> Kanthi
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160527/00e612cd/attachment.html>


More information about the OpenStack-dev mailing list