[openstack-dev] [TaskFlow] TaskFlow persistence: Job failure retry

pnkk pnkk2016 at gmail.com
Wed Jun 1 20:45:51 UTC 2016


Thanks for the nice documentation.

To my knowledge celery is widely used for distributed task processing. This
fits our requirement perfectly where we want to return immediate response
to the user from our API server and run long running task in background.
Celery also gives flexibility with the worker types(process(can overcome
GIL problems too)/evetlet...) and it also provides nice message
brokers(rabbitmq,redis...)

We used both celery and taskflow for our core processing to leverage the
benefits of both. Taskflow provides nice primitives like(execute, revert,
pre,post stuf) which takes off the load from the application.

As far as the actual issue is concerned, I found one way to solve it by
using celery "retry" option. This along with late_acks makes the
application highly fault tolerant.

http://docs.celeryproject.org/en/latest/faq.html#faq-acks-late-vs-retry

Regards,
Kanthi


On Sat, May 28, 2016 at 1:51 AM, Joshua Harlow <harlowja at fastmail.com>
wrote:

> Seems like u could just use
> http://docs.openstack.org/developer/taskflow/jobs.html (it appears that
> you may not be?); the job itself would when failed be then worked on by a
> different job consumer.
>
> Have u looked at those? It almost appears that u are using celery as a job
> distribution system (similar to the jobs.html link mentioned above)? Is
> that somewhat correct (I haven't seen anyone try this, wondering how u are
> using it and the choices that directed u to that, aka, am curious)?
>
> -Josh
>
> pnkk wrote:
>
>> To be specific, we hit this issue when the node running our service is
>> rebooted.
>> Our solution is designed in a way that each and every job is a celery
>> task and inside celery task, we create taskflow flow.
>>
>> We enabled late_acks in celery(uses rabbitmq as message broker), so if
>> our service/node goes down, other healthy service can pick the job and
>> completes it.
>> This works fine, but we just hit this rare case where the node was
>> rebooted just when taskflow is updating something to the database.
>>
>> In this case, it raises an exception and the job is marked failed. Since
>> it is complete(with failure), message is removed from the rabbitmq and
>> other worker would not be able to process it.
>> Can taskflow handle such I/O errors gracefully or should application try
>> to catch this exception? If application has to handle it what would
>> happen to that particular database transaction which failed just when
>> the node is rebooted? Who will retry this transaction?
>>
>> Thanks,
>> Kanthi
>>
>> On Fri, May 27, 2016 at 5:39 PM, pnkk <pnkk2016 at gmail.com
>> <mailto:pnkk2016 at gmail.com>> wrote:
>>
>>     Hi,
>>
>>     When taskflow engine is executing a job, the execution failed due to
>>     IO error(traceback pasted below).
>>
>>     2016-05-25 19:45:21.717 7119 ERROR
>>     taskflow.engines.action_engine.engine 127.0.1.1 [-]  Engine
>>     execution has failed, something bad must of happened (last 10
>>     machine transitions were [('SCHEDULING', 'WAITING'), ('WAITING',
>>     'ANALYZING'), ('ANALYZING', 'SCHEDULING'), ('SCHEDULING',
>>     'WAITING'), ('WAITING', 'ANALYZING'), ('ANALYZING', 'SCHEDULING'),
>>     ('SCHEDULING', 'WAITING'), ('WAITING', 'ANALYZING'), ('ANALYZING',
>>     'GAME_OVER'), ('GAME_OVER', 'FAILURE')])
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine Traceback (most recent call
>> last):
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/engine.py",
>>     line 269, in run_iter
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine
>>     failure.Failure.reraise_if_any(memory.failures)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/types/failure.py",
>>     line 336, in reraise_if_any
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     failures[0].reraise()
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/types/failure.py",
>>     line 343, in reraise
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     six.reraise(*self._exc_info)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/scheduler.py",
>>     line 94, in schedule
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine
>>     futures.add(scheduler.schedule(atom))
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/scheduler.py",
>>     line 67, in schedule
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     return
>>     self._task_action.schedule_execution(task)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/actions/task.py",
>>     line 99, in schedule_execution
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     self.change_state(task,
>>     states.RUNNING, progress=0.0)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/actions/task.py",
>>     line 67, in change_state
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine
>>     self._storage.set_atom_state(task.name <http://task.name>, state)
>>
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/fasteners/lock.py",
>>     line 85, in wrapper
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     return f(self, *args,
>>     **kwargs)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
>>     line 486, in set_atom_state
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine
>>     self._with_connection(self._save_atom_detail, source, clone)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
>>     line 341, in _with_connection
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     return functor(conn,
>>     *args, **kwargs)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
>>     line 471, in _save_atom_detail
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine
>>     original_atom_detail.update(conn.update_atom_details(atom_detail))
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/persistence/backends/impl_sqlalchemy.py",
>>     line 427, in update_atom_details
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     row =
>> conn.execute(q).first()
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
>>     line 914, in execute
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     return meth(self,
>>     multiparams, params)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
>>     line 323, in _execute_on_connection
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     return
>>     connection._execute_clauseelement(self, multiparams, params)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
>>     line 1003, in _execute_clauseelement
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine
>>     inline=len(distilled_params) > 1)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File "<string>", line 1, in
>>     <lambda>
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
>>     line 494, in compile
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     return
>>     self._compiler(dialect, bind=bind, **kw)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
>>     line 500, in _compiler
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     return
>>     dialect.statement_compiler(dialect, self, **kw)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
>>     line 392, in __init__
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     Compiled.__init__(self,
>>     dialect, statement, **kwargs)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
>>     line 190, in __init__
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     self.string =
>>     self.process(self.statement, **compile_kwargs)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
>>     line 213, in process
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     return
>>     obj._compiler_dispatch(self, **kwargs)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py",
>>     line 81, in _compiler_dispatch
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     return meth(self, **kw)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
>>     line 1579, in visit_select
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     for name, column in
>>     select._columns_plus_names
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
>>     line 1347, in _label_select_column
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine
>>     add_to_result_map=add_to_result_map
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/celery/apps/worker.py",
>>     line 288, in _handle_request
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     safe_say('worker: {0}
>>     shutdown (MainProcess)'.format(how))
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine   File
>>
>> "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/celery/apps/worker.py",
>>     line 73, in safe_say
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine     print('\n{0}'.format(msg),
>>     file=sys.__stderr__)
>>     2016-05-25 19:45:21.717 7119 TRACE
>>     taskflow.engines.action_engine.engine IOError: [Errno 5]
>>     Input/output error
>>     2016-05-25 19:45:21.717 7119 TRACE
>> taskflow.engines.action_engine.engine
>>
>>     There could be a transient network issue which prevents taskflow
>>     from reaching the mysql node.
>>     Can you please suggest a graceful way of handling it and continue
>>     processing the execution?
>>
>>     Thanks,
>>     Kanthi
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20160602/b0e8f0be/attachment.html>


More information about the OpenStack-dev mailing list