[openstack-dev] Nova workflow management update

Joshua Harlow harlowja at yahoo-inc.com
Thu May 2 18:30:23 UTC 2013


So I think this is a good start, and does start to move in the direction
that will help.

This might connect into having the set of primitives abstracted which alex
and others are talking about.

I have sorta started this @ https://review.openstack.org/#/c/27869/

The primitives for your 'lock' (aka the copy and swap on the DB) might be
a useful addition.

Do u imagine a new DB table, like 'locks-held' or something similar? Or
just continue using the 'vm_state/task_state' for this (which is a
specific impl of the DB 'lock' table). The edge case might be acceptable,
it might be a little sketchy depending on the DB setup used. I think it
would be nice to have primitives for that type of operation made ZK and
for  the DB.

I think your tasks DB table (might be better named as a active-workflow DB
table?) might be an ok approach as well, but it seems like there could
also be primitives that allow ZK to do this same work as well (so might be
another case of providing 2 impls). This could be a task-log primitive
(with a DB and ZK storage backend?).

For the liveness case I think your conductor workflow->hostname
association may be one way that will work. In the ZK case you wouldn't
need to assign the task to a conductor hostname in said workflow DB table
since ZK can assign workflow ownership via watches on say a given
zookeeper workflow specific directory node (this could be one way of doing
this, others are possible to). That’s a little different in the design
there, but might be acceptable to figure out how to do that in a way that
works in both ways. The interesting thing about the ZK approach is that
there doesn't need to be any clear association of a conductor (via a
hostname) to a workflow so that it can be resumed when said conductor is
restarted (by init.d or manually) since the ZK approach can do the
transfer of ownership (possibly using leader election here also) in a
automated way. That way you can just spin up as many conductors as you
want, and they can pick jobs off of ZK. That might solve your crazy
complex problem that would be tough to fix (just don't associate
conductors with workflows in the first place, but have them 'vote' and the
leader of said vote will be the one to work on the workflow). The code @
https://github.com/harlowja/zkplayground is useful to see something like
this in action.

So that might be another primitive that we are thinking of (I guess u
could call it the liveness primitive). Not sure what the api could be for
that, but it does seem like it would fit in this same bucket of
primitives. 

Then there is the task primitive, which is connected to said task-log
primitive (and checkpointing). My idea for this was to clearly understand
each workflow that we want to alter and refactor it into clear task
primitives, where each task 'object' does its work and can undo its work.
The combination of these tasks form a workflow which can accomplish some
specific action. I have put up a basic primitive for this in the above
review along with a linear workflow which can start to form a core
workflow library that conductor or others can use (maybe move to oslo
sometime along with the rest of these primitivies).

Thoughts?

Seeing lots of great ideas here, we can do it!
 
On 5/2/13 4:41 AM, "John Garbutt" <john at johngarbutt.com> wrote:

>That is the big problem. I think we agree a single conductor is a bad
>idea, but is the simplest fix.
>
>I was hoping to use the DB and keep it simple (ish).
>
>Start with copy and swap on the DB:
>- in a db transaction, check the existing server state, and atomically
>move to the new one
>- we kinda do the above already for some cases
>- API rejects requests in the wrong state (assuming we only support
>one task at once for a server, for now)
>- "edge case" of they get to past API, but before executed someone
>else beat them, just record and instance action saying that API
>request failed
>
>That should guard the starting of the task, then as you say, it could
>die, how do we restart?
>
>Maybe we have a "tasks" db for the conductor, so:
>- in the same transaction as the above call the conductor would...
>- assign the task to its self (by conductor queue name, i.e. host name)
>- the task can have checkpoints to help restart half way through
>
>So if it restarts, it can restart all the operations it had not yet
>completed using the info in the db. Not possible yet, but this was one
>of the main things we wanted to do anyway. If it fails on task resume,
>then it is responsible for recording that failure.
>
>As you say, there is liveness issue. If we loose a conductor, when can
>we choose to move tasks to a new conductor? The first idea I have is
>to assume a conductor, if it dies, is brought back to life by the
>administrator. Monitor it all in the same way nova-compute is
>monitored. Maybe have an admin operation that is "dangerous" but lets
>you disable a conductor and move all the tasks to a new conductor?
>Just in case the admin is unable to resume the old conductor (or a new
>conductor with the same name).
>
>There is the case where two conductors have the same name, and share
>the same DB queue. Or the case where a conductor forgets a task half
>way through and no progress is made. I am thinking we should leave
>this to the administrator to monitor, its crazy complex to fix.
>
>I think I am making reasonable requests of the cloud admins.
>There must be a flaw in this... just can't see it yet.
>Ideas?
>
>John
>
>On 1 May 2013 19:43, Joshua Harlow <harlowja at yahoo-inc.com> wrote:
>> I've started
>> https://wiki.openstack.org/wiki/TheBetterPathToLiveMigrationResizing and
>> will try to continue there.
>>
>> The other aspect that makes me wonder is after we have conductor doing
>> stuff is how do we ensure that locking of what it is doing is done
>> correctly.
>>
>> Say u have the following:
>>
>> API call #1 -> resize instance X (lets call this action A)
>> API call #2 -> resize instance X (lets call this action B)
>>
>>
>> Now both of those happen in the same millisecond, so what happens
>>(thought
>> game time!).
>>
>> It would seem they attempt to mark something in the DB saying 'working
>>on
>> X' by altering instance X's 'task/vm_state'. Ok so u can put a
>>transaction
>> around said write to the 'task/vm_state' of instance X to avoid both of
>> those api calls attempting to continue doing the work. So that’s good.
>>So
>> then lets say api #1 sends a message to some conductor Z asking it to do
>> the work via the MQ, that’s great, then the conductor Z starts doing
>>work
>> on instance X and such.
>>
>> So now the big iffy question that I have is what happens if conductor Z
>>is
>> 'killed' (say via error, exception, power failure, kill -9). What
>>happens
>> to action A? How can another conductor be assigned the work to do action
>> A? Will there be a new periodic task to scan the DB for 'dead' actions,
>> how do we determine if an action is dead or just taking a very long
>>time?
>> This 'liveness' issue is a big one that I think needs to be considered
>>and
>> if conductor and zookeeper get connected, then I think it can be done.
>>
>> Then the other big iffy stuff is how do we stop a third API call from
>> invoking a third action on a resource associated with instance X (say a
>> deletion of a volume) while the first api action is still being
>>conducted,
>> just associating a instance level lock via 'task/vm_state' is not the
>> correct way to way to lock resources associated with instance X. This is
>> where zookeeper can come into play again (since its core design was
>>built
>> for distributed locking) and it can be used to not only lock the
>>instance
>> X 'task/vm_state' but all other resources associated with instance X
>>(in a
>> reliable manner).
>>
>> Thoughts?
>>
>> On 5/1/13 10:56 AM, "John Garbutt" <john at johngarbutt.com> wrote:
>>
>>>Hey,
>>>
>>>I think some lightweight sequence diagrams could make sense.
>>>
>>>On 29 April 2013 21:55, Joshua Harlow <harlowja at yahoo-inc.com> wrote:
>>>> Any thoughts on how the current conductor db-activity works with this?
>>>> I can see two entry points to conductor:
>>>> DB data calls
>>>>   |
>>>>   ------------------------------------------Conductor-->RPC/DB calls
>>>>to
>>>>do
>>>> this stuff
>>>>                                                |
>>>> Workflow on behalf of something calls          |
>>>>   |                                            |
>>>>   ---------------------------------------------|
>>>>
>>>> Maybe its not a concern for 'H' but it seems one of those doesn¹t
>>>>belong
>>>> there (cough cough DB stuff).
>>>
>>>Maybe for the next release. It should become obvious I guess. I hope
>>>those db calls will disappear once we pull the workflows properly into
>>>conductor and the other servers become more stateless (in terms of
>>>nova db state).
>>>
>>>Key question: Should the conductor be allowed to make DB calls? I think
>>>yes?
>>>
>>>> My writeup @ https://wiki.openstack.org/wiki/StructuredStateManagement
>>>>is
>>>> a big part of the overall goal I think, where I think the small
>>>>iterations
>>>> are part of this goal, yet likely both small and big goals will be
>>>> happening at once, so it would be useful to ensure that we talk about
>>>>the
>>>> bigger goal and make sure the smaller iteration goal will eventually
>>>> arrive at the bigger goal (or can be adjusted to be that way). Since
>>>>some
>>>> rackspace folks will also be helping out building the underlying
>>>> foundation (convection library) for the end-goal it would be great to
>>>>work
>>>> together and make sure all small iterations also align with that
>>>> foundational library work.
>>>
>>>Take a look at spawn in XenAPI, it is heading down this direction:
>>>https://github.com/openstack/nova/blob/master/nova/virt/xenapi/vmops.py#
>>>L3
>>>35
>>>
>>>I think we should just make a very small bit of the operation do
>>>rollback and state management, which is more just an exercise, and
>>>then start to pull more of the code into line as time progresses.
>>>Probably best done on something that has already been pulled into a
>>>conductor style job?
>>>
>>>> I'd be interested in what u think about moving the scheduler code
>>>>around,
>>>> since this also connects into some work the cisco folks want to do for
>>>> better scheduling, so that is yet another coordination of work that
>>>>needs
>>>> to happen (to end up at the same end-goal there as well).
>>>
>>>Yes, I think its very related. I see this kind of thing:
>>>
>>>API --cast--> Conductor --call--> scheduler
>>>                                    --call--> compute
>>>                                    --call-->.....
>>>                                    --db--> finally state update shows
>>>completion of task
>>>
>>>Eventually the whole workflow, its persistence and rollback will be
>>>controlled by the new framework. In the first case we may just make
>>>sure the resource assignment gets rolled back if the call after the
>>>schedule fails, and we correctly try to call the scheduler again? The
>>>current live-migration scheduling code sort of does this kind of thing
>>>already.
>>>
>>>> I was thinking that documenting the current situation, possibly @
>>>> https://wiki.openstack.org/wiki/TheBetterPathToLiveMigration would
>>>>help.
>>>> Something like https://wiki.openstack.org/wiki/File:Run_workflow.png
>>>>might
>>>> help to easily visualize the current and fixed 'flow'/thread of
>>>>execution.
>>>
>>>Seems valuable. I will do something for live-migration one before
>>>starting on that. I kinda started on this (in text form) when I was
>>>doing the XenAPI live-migration:
>>>https://wiki.openstack.org/wiki/XenServer/LiveMigration#Live_Migration_R
>>>PC
>>>_Calls
>>>
>>>We should probably do one for resize too.
>>>
>>>John
>>



More information about the OpenStack-dev mailing list