[openstack-dev] [Neutron] Introducing task oriented workflows

Hirofumi Ichihara ichihara.hirofumi at lab.ntt.co.jp
Tue Jun 3 09:20:20 UTC 2014


Hi, Salvatore

> It is totally correct that most Neutron resources have a sloppy status management. Mostly because, as already pointed out, the 'status' for most resource was conceived to be a 'network fabric' status rather than a resource synchronisation status.
Exactly, I reckon that neutron needs resource synchronization status.

> As it emerged from previous posts in this thread, I reckon we have three choices:
> 1) Add a new attribute for describing "configuration" state. For instance this will have values such as PENDING_UPDATE, PENDING_DELETE, IN_SYNC, OUT_OF_SYNC, etc.
> 2) Merge status and configuration statuses in a single attribute. This will probably result simpler from a client perspective, but there are open questions such as whether a resource for which a task is in progress and is down should be reported as 'down' or 'pending_updage'.
> 3) Not use any new flags, and use tasks to describe whether there are operations in progress on a resource.
> The status attribute will describe exclusively the 'fabric' status of a resources; however tasks will be exposed through the API - and a resource in sync will be a resource with no PENDING or FAILED task active on it.
Good suggestions.
I reckon that choice (3) is discussion about new API and choice (1) (2) are discussion about current API.
It is not good the problem of current API continues remaining in the future.
So they should be discussed individually and improve the fabric status by (1) or (2). 
When (3) will be achieved, if neutron has same fabric status problem, users may be confused about the difference between resource status and task status. 
Additionally, to be exact, the task show not resource status but API process status.

I reckon we should improve the fabric status, then add task to neutron.
Also, I think (2) is good. Because there is performance of LBaaS model.

thanks,
Hirofumi

---------------------------------------------
市原 裕史 (Ichihara Hirofumi)
NTTソフトウェアイノベーションセンタ
Tel:0422-59-2843  Fax:0422-59-2699
Email:ichihara.hirofumi at lab.ntt.co.jp
---------------------------------------------


On 2014/05/30, at 17:57, Salvatore Orlando <sorlando at nicira.com> wrote:

> Hi Hirofumi,
> 
> I reckon this has been immediately recognised as a long term effort.
> However, I just want to clarify that by "long term" I don't mean pushing it back until we get to the next release cycle and we realize we are in the same place where we are today!
> 
> It is totally correct that most Neutron resources have a sloppy status management. Mostly because, as already pointed out, the 'status' for most resource was conceived to be a 'network fabric' status rather than a resource synchronisation status.
> 
> As it emerged from previous posts in this thread, I reckon we have three choices:
> 1) Add a new attribute for describing "configuration" state. For instance this will have values such as PENDING_UPDATE, PENDING_DELETE, IN_SYNC, OUT_OF_SYNC, etc.
> 2) Merge status and configuration statuses in a single attribute. This will probably result simpler from a client perspective, but there are open questions such as whether a resource for which a task is in progress and is down should be reported as 'down' or 'pending_updage'.
> 3) Not use any new flags, and use tasks to describe whether there are operations in progress on a resource.
> The status attribute will describe exclusively the 'fabric' status of a resources; however tasks will be exposed through the API - and a resource in sync will be a resource with no PENDING or FAILED task active on it.
> 
> The above are just options at the moment; I tend to lean toward the latter, but it would be great to have your feedback.
> 
> Salvatore
> 
> 
> 
> On 28 May 2014 11:20, Hirofumi Ichihara <ichihara.hirofumi at lab.ntt.co.jp> wrote:
> Hi, Salvatore
> 
> I think neutron needs the task management too.
> 
> IMO, the problem of neutron resource status should be discussed individually.
> Task management enable neutron to roll back API operation and delete trash of resource, try API operation again in one API process.
> Of course, we can use task to correct inconsistency between neutron DB(resource status) and actual resource configuration.
> But, we should add resource status management to some resources before task.
> For example, LBaaS has resource status management[1].
> Neutron router, port don't mange status is basic problem.
> 
>> For instance a port is "UP" if it's been wired by the OVS agent; it often does not tell us whether the actual resource configuration is exactly the desired one in the database. For instance, if the ovs agent fails to apply security groups to a port, the port stays "ACTIVE" and the user might never know there was an error and the actual state diverged from the desired one.
> So, we should solve this problem by resource status management such LBaaS rather than task.
> 
> I don't deny task, but we need to discuss for task long term, I hope the status management will be modified right away.
> 
> [1] https://wiki.openstack.org/wiki/Neutron/LBaaS/API_1.0#Synchronous_versus_Asynchronous_Plugin_Behavior
> 
> thanks,
> Hirofumi
> 
> ---------------------------------------------
> Hirofumi Ichihara
> NTT Software Innovation Center
> Tel:+81-422-59-2843  Fax:+81-422-59-2699
> Email:ichihara.hirofumi at lab.ntt.co.jp
> ---------------------------------------------
> 
> 
> On 2014/05/23, at 7:34, Salvatore Orlando <sorlando at nicira.com> wrote:
> 
>> As most of you probably know already, this is one of the topics discussed during the Juno summit [1].
>> I would like to kick off the discussion in order to move towards a concrete design.
>> 
>> Preamble: Considering the meat that's already on the plate for Juno, I'm not advocating that whatever comes out of this discussion should be put on the Juno roadmap. However, preparation (or yak shaving) activities that should be identified as pre-requisite might happen during the Juno time frame assuming that they won't interfere with other critical or high priority activities.
>> This is also a very long post; the TL;DR summary is that I would like to explore task-oriented communication with the backend and how it should be reflected in the API - gauging how the community feels about this, and collecting feedback regarding design, constructs, and related tools/techniques/technologies.
>> 
>> At the summit a broad range of items were discussed during the session, and most of them have been reported in the etherpad [1].
>> 
>> First, I think it would be good to clarify whether we're advocating a task-based API, a workflow-oriented operation processing, or both.
>> 
>> --> About a task-based API
>> 
>> In a task-based API, most PUT/POST API operations would return tasks rather than neutron resources, and users of the API will interact directly with tasks.
>> I put an example in [2] to avoid cluttering this post with too much text.
>> As the API operation simply launches a task - the database state won't be updated until the task is completed.
>> 
>> Needless to say, this would be a radical change to Neutron's API; it should be carefully evaluated and not considered for the v2 API.
>> Even if it is easily recognisable that this approach has a few benefits, I don't think this will improve usability of the API at all. Indeed this will limit the ability of operating on a resource will a task is in execution on it, and will also require neutron API users to change the paradigm the use to interact with the API; for not mentioning the fact that it would look weird if neutron is the only API endpoint in Openstack operating in this way.
>> For the Neutron API, I think that its operations should still be manipulating the database state, and possibly return immediately after that (*) - a task, or to better say a workflow will then be started, executed asynchronously, and update the resource status on completion.
>> 
>> --> On workflow-oriented operations
>> 
>> The benefits of it when it comes to easily controlling operations and ensuring consistency in case of failures are obvious. For what is worth, I have been experimenting introducing this kind of capability in the NSX plugin in the past few months. I've been using celery as a task queue, and writing the task management code from scratch - only to realize that the same features I was implementing are already supported by taskflow.
>> 
>> I think that all parts of Neutron API can greatly benefit from introducing a flow-based approach.
>> Some examples:
>> - pre/post commit operations in the ML2 plugin can be orchestrated a lot better as a workflow, articulating operations on the various drivers in a graph
>> - operation spanning multiple plugins (eg: add router interface) could be simplified using clearly defined tasks for the L2 and L3 parts
>> - it would be finally possible to properly manage resources' "operational status", as well as knowing whether the actual configuration of the backend matches the database configuration
>> - synchronous plugins might be converted into asynchronous thus improving their API throughput
>> 
>> Now, the caveats:
>> - during the sessions it was correctly pointed out that special care is required with multiple producers (ie: api servers) as workflows should be always executed in the correct order
>> - it is probably be advisable to serialize workflows operating on the same resource; this might lead to unexpected situations (potentially to deadlocks) with workflows operating on multiple resources
>> - if the API is asynchronous, and multiple workflows might be queued or in execution at a given time, rolling back the DB operation on failures is probably not advisable (it would not be advisable anyway in any asynchronous framework). If the API instead stays synchronous the revert action for a failed task might also restore the db state for a resource; but I think that keeping the API synchronous missed a bit the point of this whole work - feel free to show your disagreement here!
>> - some neutron workflows are actually initiated by agents; this is the case, for instance, of the workflow for doing initial L2 and security group configuration for a port.
>> - it's going to be a lot of work, and we need to devise a strategy to either roll this changes in the existing plugins or just decide that future v3 plugins will use it.
>> 
>> From the implementation side, I've done a bit of research and task queue like celery only implement half of what is needed; conversely I have not been able to find a workflow manager, at least in the python world, as complete and suitable as taskflow.
>> So my preference will be obviously to use it, and contribute to it should we realize Neutron needs some changes to suit its needs. Growing something neutron-specific in tree is something I'd rule out.
>> 
>> (*) This is a bit different from what many plugins do, as they execute requests synchronously and return only once the backend request is completed.
>> 
>> --> Tasks and the API
>> 
>> The etherpad [1] contains a lot of interesting notes on this topic.
>> One important item it to understand how tasks affect the resource's status to indicate their completion or failure. So far Neutron resource status pretty much expresses its "fabric" status. For instance a port is "UP" if it's been wired by the OVS agent; it often does not tell us whether the actual resource configuration is exactly the desired one in the database. For instance, if the ovs agent fails to apply security groups to a port, the port stays "ACTIVE" and the user might never know there was an error and the actual state diverged from the desired one.
>> 
>> It is therefore important to allow users to know whether the backend state is in sync with the db; tools like taskflow will be really helpful to this aim.
>> However, how should this be represented? The main options are to either have a new attribute describing the resource sync state, or to extend the semantics of the current status attribute to include also resource sync state. I've put some rumblings on the subjects in the etherpad [3].
>> Still, it has been correctly pointed out that it might not be enough to know that a resource is out of sync, but it is good to know which operation exactly failed; this is where exposing somehow tasks through the API might come handy.
>> 
>> For instance one could do something like:
>> 
>> GET /tasks?resource_id=<res_id>&task_state=FAILED
>> 
>> to get failure details for a given resource.
>> 
>> --> How to proceed
>> 
>> This is where I really don't know... and I will therefore be brief.
>> We'll probably need some more brainstorming to flush out all the details.
>> Once that is done, it might the case of evaluating what needs to be done and whether it is better to target this work onto existing plugins, or moving it out to v3 plugins (and hence do the actual work once the "core refactoring" activities are complete).
>> 
>> Regards,
>> Salvatore
>> 
>> 
>> [1] https://etherpad.openstack.org/p/integrating-task-into-neutron
>> [2] http://paste.openstack.org/show/81184/
>> [3] https://etherpad.openstack.org/p/sillythings
>> 
>> 
>> 
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
> 
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20140603/ea27221d/attachment.html>


More information about the OpenStack-dev mailing list