[openstack-dev] [TaskFlow] TaskFlow persistence: Job failure retry

Joshua Harlow harlowja at fastmail.com
Mon Jun 6 03:56:47 UTC 2016


Cool, we'll feel free to find the taskflow (and others) either in 
#openstack-oslo or #openstack-state-management if you have any questions.

-Josh

pnkk wrote:
> I am working on NFV orchestrator based on MANO
>
> Regards,
> Kanthi
>
> On Thu, Jun 2, 2016 at 3:00 AM, Joshua Harlow <harlowja at fastmail.com
> <mailto:harlowja at fastmail.com>> wrote:
>
>     Interesting way to combine taskflow + celery.
>
>     I didn't expect it to be used like this, but the more power to you!
>
>     Taskflow itself has some similar capabilities via
>     http://docs.openstack.org/developer/taskflow/workers.html#design but
>     anyway what u've done is pretty neat as well.
>
>     I am assuming this isn't an openstack project (due to usage of
>     celery), any details on what's being worked on (am curious here)?
>
>     pnkk wrote:
>
>         Thanks for the nice documentation.
>
>         To my knowledge celery is widely used for distributed task
>         processing.
>         This fits our requirement perfectly where we want to return
>         immediate
>         response to the user from our API server and run long running
>         task in
>         background. Celery also gives flexibility with the worker
>         types(process(can overcome GIL problems too)/evetlet...) and it also
>         provides nice message brokers(rabbitmq,redis...)
>
>         We used both celery and taskflow for our core processing to
>         leverage the
>         benefits of both. Taskflow provides nice primitives like(execute,
>         revert, pre,post stuf) which takes off the load from the
>         application.
>
>         As far as the actual issue is concerned, I found one way to
>         solve it by
>         using celery "retry" option. This along with late_acks makes the
>         application highly fault tolerant.
>
>         http://docs.celeryproject.org/en/latest/faq.html#faq-acks-late-vs-retry
>
>         Regards,
>         Kanthi
>
>
>         On Sat, May 28, 2016 at 1:51 AM, Joshua Harlow
>         <harlowja at fastmail.com <mailto:harlowja at fastmail.com>
>         <mailto:harlowja at fastmail.com <mailto:harlowja at fastmail.com>>>
>         wrote:
>
>              Seems like u could just use
>         http://docs.openstack.org/developer/taskflow/jobs.html (it appears
>              that you may not be?); the job itself would when failed be then
>              worked on by a different job consumer.
>
>              Have u looked at those? It almost appears that u are using
>         celery as
>              a job distribution system (similar to the jobs.html link
>         mentioned
>              above)? Is that somewhat correct (I haven't seen anyone try
>         this,
>              wondering how u are using it and the choices that directed u to
>              that, aka, am curious)?
>
>              -Josh
>
>              pnkk wrote:
>
>                  To be specific, we hit this issue when the node running our
>                  service is
>                  rebooted.
>                  Our solution is designed in a way that each and every
>         job is a
>                  celery
>                  task and inside celery task, we create taskflow flow.
>
>                  We enabled late_acks in celery(uses rabbitmq as message
>         broker),
>                  so if
>                  our service/node goes down, other healthy service can
>         pick the
>                  job and
>                  completes it.
>                  This works fine, but we just hit this rare case where
>         the node was
>                  rebooted just when taskflow is updating something to
>         the database.
>
>                  In this case, it raises an exception and the job is marked
>                  failed. Since
>                  it is complete(with failure), message is removed from the
>                  rabbitmq and
>                  other worker would not be able to process it.
>                  Can taskflow handle such I/O errors gracefully or should
>                  application try
>                  to catch this exception? If application has to handle
>         it what would
>                  happen to that particular database transaction which
>         failed just
>                  when
>                  the node is rebooted? Who will retry this transaction?
>
>                  Thanks,
>                  Kanthi
>
>                  On Fri, May 27, 2016 at 5:39 PM, pnkk
>         <pnkk2016 at gmail.com <mailto:pnkk2016 at gmail.com>
>         <mailto:pnkk2016 at gmail.com <mailto:pnkk2016 at gmail.com>>
>         <mailto:pnkk2016 at gmail.com <mailto:pnkk2016 at gmail.com>
>         <mailto:pnkk2016 at gmail.com <mailto:pnkk2016 at gmail.com>>>> wrote:
>
>                       Hi,
>
>                       When taskflow engine is executing a job, the execution
>                  failed due to
>                       IO error(traceback pasted below).
>
>                       2016-05-25 19:45:21.717 7119 ERROR
>                       taskflow.engines.action_engine.engine 127.0.1.1
>         [-]  Engine
>                       execution has failed, something bad must of
>         happened (last 10
>                       machine transitions were [('SCHEDULING', 'WAITING'),
>                  ('WAITING',
>         'ANALYZING'), ('ANALYZING', 'SCHEDULING'), ('SCHEDULING',
>         'WAITING'), ('WAITING', 'ANALYZING'), ('ANALYZING', 'SCHEDULING'),
>                       ('SCHEDULING', 'WAITING'), ('WAITING', 'ANALYZING'),
>                  ('ANALYZING',
>         'GAME_OVER'), ('GAME_OVER', 'FAILURE')])
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine Traceback (most
>                  recent call last):
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/engine.py",
>                       line 269, in run_iter
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>                       failure.Failure.reraise_if_any(memory.failures)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/types/failure.py",
>                       line 336, in reraise_if_any
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>           failures[0].reraise()
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/types/failure.py",
>                       line 343, in reraise
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>                    six.reraise(*self._exc_info)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/scheduler.py",
>                       line 94, in schedule
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>                       futures.add(scheduler.schedule(atom))
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/scheduler.py",
>                       line 67, in schedule
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     return
>                       self._task_action.schedule_execution(task)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/actions/task.py",
>                       line 99, in schedule_execution
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>                    self.change_state(task,
>                       states.RUNNING, progress=0.0)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/engines/action_engine/actions/task.py",
>                       line 67, in change_state
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>                       self._storage.set_atom_state(task.name
>         <http://task.name> <http://task.name>
>         <http://task.name>, state)
>
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/fasteners/lock.py",
>                       line 85, in wrapper
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     return
>         f(self, *args,
>                       **kwargs)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
>                       line 486, in set_atom_state
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>                       self._with_connection(self._save_atom_detail,
>         source, clone)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
>                       line 341, in _with_connection
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     return
>         functor(conn,
>                       *args, **kwargs)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/storage.py",
>                       line 471, in _save_atom_detail
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>
>
>         original_atom_detail.update(conn.update_atom_details(atom_detail))
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/taskflow/persistence/backends/impl_sqlalchemy.py",
>                       line 427, in update_atom_details
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     row =
>                  conn.execute(q).first()
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
>                       line 914, in execute
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     return
>         meth(self,
>                       multiparams, params)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
>                       line 323, in _execute_on_connection
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     return
>                       connection._execute_clauseelement(self,
>         multiparams, params)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
>                       line 1003, in _execute_clauseelement
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>                       inline=len(distilled_params) > 1)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "<string>",
>                  line 1, in
>         <lambda>
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
>                       line 494, in compile
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     return
>                       self._compiler(dialect, bind=bind, **kw)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py",
>                       line 500, in _compiler
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     return
>                       dialect.statement_compiler(dialect, self, **kw)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
>                       line 392, in __init__
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>                    Compiled.__init__(self,
>                       dialect, statement, **kwargs)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
>                       line 190, in __init__
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>           self.string =
>                       self.process(self.statement, **compile_kwargs)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
>                       line 213, in process
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     return
>                       obj._compiler_dispatch(self, **kwargs)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/visitors.py",
>                       line 81, in _compiler_dispatch
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     return
>         meth(self,
>                  **kw)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
>                       line 1579, in visit_select
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine     for
>         name, column in
>                       select._columns_plus_names
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/sqlalchemy/sql/compiler.py",
>                       line 1347, in _label_select_column
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>                       add_to_result_map=add_to_result_map
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/celery/apps/worker.py",
>                       line 288, in _handle_request
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>           safe_say('worker: {0}
>                       shutdown (MainProcess)'.format(how))
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine   File
>         "/opt/nso/nso-1.1223-default/nfvo-0.8.0.dev1438/.venv/local/lib/python2.7/site-packages/celery/apps/worker.py",
>                       line 73, in safe_say
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine
>                    print('\n{0}'.format(msg),
>                       file=sys.__stderr__)
>                       2016-05-25 19:45:21.717 7119 TRACE
>                       taskflow.engines.action_engine.engine IOError:
>         [Errno 5]
>                       Input/output error
>                       2016-05-25 19:45:21.717 7119 TRACE
>                  taskflow.engines.action_engine.engine
>
>                       There could be a transient network issue which
>         prevents
>                  taskflow
>                       from reaching the mysql node.
>                       Can you please suggest a graceful way of handling
>         it and
>                  continue
>                       processing the execution?
>
>                       Thanks,
>                       Kanthi
>
>
>
>         __________________________________________________________________________
>                  OpenStack Development Mailing List (not for usage
>         questions)
>                  Unsubscribe:
>         OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>         <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>         <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>         http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
>         __________________________________________________________________________
>              OpenStack Development Mailing List (not for usage questions)
>              Unsubscribe:
>         OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>         <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>         <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>         http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>         __________________________________________________________________________
>         OpenStack Development Mailing List (not for usage questions)
>         Unsubscribe:
>         OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>         <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>         http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>     __________________________________________________________________________
>     OpenStack Development Mailing List (not for usage questions)
>     Unsubscribe:
>     OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>     <http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
>     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list