[openstack-dev] [Neutron] Introducing task oriented workflows

Andrew Laski andrew.laski at rackspace.com
Tue Jun 3 14:27:34 UTC 2014

On 05/22/2014 08:16 PM, Nachi Ueno wrote:
> Hi Salvatore
> Thank you for your posting this.
> IMO, this topic shouldn't be limited for Neutron only.
> Users wants consistent API between OpenStack project, right?
> In Nova, a server has task_state, so Neutron should do same way.

We're moving away from the simple task_state field in Nova towards a 
more comprehensive task model.  See 
https://review.openstack.org/#/c/86938/ for the nova-spec around this.

> 2014-05-22 15:34 GMT-07:00 Salvatore Orlando <sorlando at nicira.com>:
>> As most of you probably know already, this is one of the topics discussed
>> during the Juno summit [1].
>> I would like to kick off the discussion in order to move towards a concrete
>> design.
>> Preamble: Considering the meat that's already on the plate for Juno, I'm not
>> advocating that whatever comes out of this discussion should be put on the
>> Juno roadmap. However, preparation (or yak shaving) activities that should
>> be identified as pre-requisite might happen during the Juno time frame
>> assuming that they won't interfere with other critical or high priority
>> activities.
>> This is also a very long post; the TL;DR summary is that I would like to
>> explore task-oriented communication with the backend and how it should be
>> reflected in the API - gauging how the community feels about this, and
>> collecting feedback regarding design, constructs, and related
>> tools/techniques/technologies.
>> At the summit a broad range of items were discussed during the session, and
>> most of them have been reported in the etherpad [1].
>> First, I think it would be good to clarify whether we're advocating a
>> task-based API, a workflow-oriented operation processing, or both.
>> --> About a task-based API
>> In a task-based API, most PUT/POST API operations would return tasks rather
>> than neutron resources, and users of the API will interact directly with
>> tasks.
>> I put an example in [2] to avoid cluttering this post with too much text.
>> As the API operation simply launches a task - the database state won't be
>> updated until the task is completed.
>> Needless to say, this would be a radical change to Neutron's API; it should
>> be carefully evaluated and not considered for the v2 API.
>> Even if it is easily recognisable that this approach has a few benefits, I
>> don't think this will improve usability of the API at all. Indeed this will
>> limit the ability of operating on a resource will a task is in execution on
>> it, and will also require neutron API users to change the paradigm the use
>> to interact with the API; for not mentioning the fact that it would look
>> weird if neutron is the only API endpoint in Openstack operating in this
>> way.
>> For the Neutron API, I think that its operations should still be
>> manipulating the database state, and possibly return immediately after that
>> (*) - a task, or to better say a workflow will then be started, executed
>> asynchronously, and update the resource status on completion.
>> --> On workflow-oriented operations
>> The benefits of it when it comes to easily controlling operations and
>> ensuring consistency in case of failures are obvious. For what is worth, I
>> have been experimenting introducing this kind of capability in the NSX
>> plugin in the past few months. I've been using celery as a task queue, and
>> writing the task management code from scratch - only to realize that the
>> same features I was implementing are already supported by taskflow.
>> I think that all parts of Neutron API can greatly benefit from introducing a
>> flow-based approach.
>> Some examples:
>> - pre/post commit operations in the ML2 plugin can be orchestrated a lot
>> better as a workflow, articulating operations on the various drivers in a
>> graph
>> - operation spanning multiple plugins (eg: add router interface) could be
>> simplified using clearly defined tasks for the L2 and L3 parts
>> - it would be finally possible to properly manage resources' "operational
>> status", as well as knowing whether the actual configuration of the backend
>> matches the database configuration
>> - synchronous plugins might be converted into asynchronous thus improving
>> their API throughput
>> Now, the caveats:
>> - during the sessions it was correctly pointed out that special care is
>> required with multiple producers (ie: api servers) as workflows should be
>> always executed in the correct order
>> - it is probably be advisable to serialize workflows operating on the same
>> resource; this might lead to unexpected situations (potentially to
>> deadlocks) with workflows operating on multiple resources
>> - if the API is asynchronous, and multiple workflows might be queued or in
>> execution at a given time, rolling back the DB operation on failures is
>> probably not advisable (it would not be advisable anyway in any asynchronous
>> framework). If the API instead stays synchronous the revert action for a
>> failed task might also restore the db state for a resource; but I think that
>> keeping the API synchronous missed a bit the point of this whole work - feel
>> free to show your disagreement here!
>> - some neutron workflows are actually initiated by agents; this is the case,
>> for instance, of the workflow for doing initial L2 and security group
>> configuration for a port.
>> - it's going to be a lot of work, and we need to devise a strategy to either
>> roll this changes in the existing plugins or just decide that future v3
>> plugins will use it.
>>  From the implementation side, I've done a bit of research and task queue
>> like celery only implement half of what is needed; conversely I have not
>> been able to find a workflow manager, at least in the python world, as
>> complete and suitable as taskflow.
>> So my preference will be obviously to use it, and contribute to it should we
>> realize Neutron needs some changes to suit its needs. Growing something
>> neutron-specific in tree is something I'd rule out.
>> (*) This is a bit different from what many plugins do, as they execute
>> requests synchronously and return only once the backend request is
>> completed.
>> --> Tasks and the API
>> The etherpad [1] contains a lot of interesting notes on this topic.
>> One important item it to understand how tasks affect the resource's status
>> to indicate their completion or failure. So far Neutron resource status
>> pretty much expresses its "fabric" status. For instance a port is "UP" if
>> it's been wired by the OVS agent; it often does not tell us whether the
>> actual resource configuration is exactly the desired one in the database.
>> For instance, if the ovs agent fails to apply security groups to a port, the
>> port stays "ACTIVE" and the user might never know there was an error and the
>> actual state diverged from the desired one.
>> It is therefore important to allow users to know whether the backend state
>> is in sync with the db; tools like taskflow will be really helpful to this
>> aim.
>> However, how should this be represented? The main options are to either have
>> a new attribute describing the resource sync state, or to extend the
>> semantics of the current status attribute to include also resource sync
>> state. I've put some rumblings on the subjects in the etherpad [3].
>> Still, it has been correctly pointed out that it might not be enough to know
>> that a resource is out of sync, but it is good to know which operation
>> exactly failed; this is where exposing somehow tasks through the API might
>> come handy.
>> For instance one could do something like:
>> GET /tasks?resource_id=<res_id>&task_state=FAILED
>> to get failure details for a given resource.
>> --> How to proceed
>> This is where I really don't know... and I will therefore be brief.
>> We'll probably need some more brainstorming to flush out all the details.
>> Once that is done, it might the case of evaluating what needs to be done and
>> whether it is better to target this work onto existing plugins, or moving it
>> out to v3 plugins (and hence do the actual work once the "core refactoring"
>> activities are complete).
>> Regards,
>> Salvatore
>> [1] https://etherpad.openstack.org/p/integrating-task-into-neutron
>> [2] http://paste.openstack.org/show/81184/
>> [3] https://etherpad.openstack.org/p/sillythings
