[openstack-dev] [Solum] Nova task API, task flow, and actions from design summit
ccoleman at redhat.com
Wed Nov 6 01:39:59 UTC 2013
Quick summary of interesting discussions yesterday at the summit that relate to things we will face in Solum wrt async flows.
The two nova sessions on async work  and the task API  had a lot of good back and forth. The problem space is how to model and convey long running tasks in the nova API, and then how to start moving long running tasks into a consistent place in the nova code base. There appeared to be broad consensus that this move should and would happen in icehouse for a few important tasks (snapshot) and the rough shape of an API, but that there are a lot of open questions about how to best handle the hard problems (flow state persistence, read/write access patterns into a persistent store, how to make tasks idempotent across retries and in the face of partitions and distributed transactions).
A highlight for me was that it almost exactly (down to a very low level) matched a set of discussions we've been having in Openshift. The problem space is the same - you have a virtual resource (application) that manifests as a distributed set of servers that must be coordinated. You want to create (but create can be long running and can fail very late in the flow), you can restart and start these resources (usually in parallel), delete needs to be able to cut across a deep queue of work, and (although this isn't yet a nova problem, but it will be a heat/Solum problem) you need to allow multiple operations to execute in parallel. These are all application life cycle problems that Heat and Solum will have to deal with - with Solum potentially providing a thin layer on top of the Heat calls (or no layer).
The other session was glance and taskflow  - they had general consensus to move ahead with their task API on top of a task flow implementation for a few of their existing log running tasks. Someone from cinder talked about their experience - some of the known gaps in task flow include restart of a job at a previous checkpoint (there are other domain problems on top of that of course) as well as the distributed execution engine for task flow (that would allow work to be more easily distributed across a cluster). Some follow up discussion included the need for there to be general collaboration across the teams on demonstrating patterns of use around the harder problems (restart of flows, different types of distributed retry and failure recovery, idempotent calls).
For Solum, I think we need to be seriously prototyping a few relevant long running tasks (create, build, deploy) using task flow and get familiar with the model. And likewise, we need to be following the task API work in nova and glance closely, and working with heat and others to track this work.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev