[openstack-dev] [Heat] Convergence proof-of-concept showdown

Zane Bitter zbitter at redhat.com
Fri Dec 12 01:07:04 UTC 2014

On 11/12/14 08:26, Murugan, Visnusaran wrote:
>>> [Murugan, Visnusaran]
>>> In case of rollback where we have to cleanup earlier version of resources,
>> we could get the order from old template. We'd prefer not to have a graph
>> table.
>> In theory you could get it by keeping old templates around. But that means
>> keeping a lot of templates, and it will be hard to keep track of when you want
>> to delete them. It also means that when starting an update you'll need to
>> load every existing previous version of the template in order to calculate the
>> dependencies. It also leaves the dependencies in an ambiguous state when a
>> resource fails, and although that can be worked around it will be a giant pain
>> to implement.
> Agree that looking to all templates for a delete is not good. But baring
> Complexity, we feel we could achieve it by way of having an update and a
> delete stream for a stack update operation. I will elaborate in detail in the
> etherpad sometime tomorrow :)
>> I agree that I'd prefer not to have a graph table. After trying a couple of
>> different things I decided to store the dependencies in the Resource table,
>> where we can read or write them virtually for free because it turns out that
>> we are always reading or updating the Resource itself at exactly the same
>> time anyway.
> Not sure how this will work in an update scenario when a resource does not change
> and its dependencies do.

We'll always update the requirements, even when the properties don't change.

> Also taking care of deleting resources in order will
> be an issue.

It works fine.

> This implies that there will be different versions of a resource which
> will even complicate further.

No it doesn't, other than the different versions we already have due to 

>>>> This approach reduces DB queries by waiting for completion notification
>> on a topic. The drawback I see is that delete stack stream will be huge as it
>> will have the entire graph. We can always dump such data in
>> ResourceLock.data Json and pass a simple flag "load_stream_from_db" to
>> converge RPC call as a workaround for delete operation.
>>> This seems to be essentially equivalent to my 'SyncPoint' proposal[1], with
>> the key difference that the data is stored in-memory in a Heat engine rather
>> than the database.
>>> I suspect it's probably a mistake to move it in-memory for similar
>>> reasons to the argument Clint made against synchronising the marking off
>> of dependencies in-memory. The database can handle that and the problem
>> of making the DB robust against failures of a single machine has already been
>> solved by someone else. If we do it in-memory we are just creating a single
>> point of failure for not much gain. (I guess you could argue it doesn't matter,
>> since if any Heat engine dies during the traversal then we'll have to kick off
>> another one anyway, but it does limit our options if that changes in the
>> future.) [Murugan, Visnusaran] Resource completes, removes itself from
>> resource_lock and notifies engine. Engine will acquire parent lock and initiate
>> parent only if all its children are satisfied (no child entry in resource_lock).
>> This will come in place of Aggregator.
>> Yep, if you s/resource_lock/SyncPoint/ that's more or less exactly what I did.
>> The three differences I can see are:
>> 1) I think you are proposing to create all of the sync points at the start of the
>> traversal, rather than on an as-needed basis. This is probably a good idea. I
>> didn't consider it because of the way my prototype evolved, but there's now
>> no reason I can see not to do this.
>> If we could move the data to the Resource table itself then we could even
>> get it for free from an efficiency point of view.
> +1. But we will need engine_id to be stored somewhere for recovery purpose (easy to be queried format).

Yeah, so I'm starting to think you're right, maybe the/a Lock table is 
the right thing to use there. We could probably do it within the 
resource table using the same select-for-update to set the engine_id, 
but I agree that we might be starting to jam too much into that one table.

> Sync points are created as-needed. Single resource is enough to restart that entire stream.
> I think there is a disconnect in our understanding. I will detail it as well in the etherpad.

OK, that would be good.

>> 2) You're using a single list from which items are removed, rather than two
>> lists (one static, and one to which items are added) that get compared.
>> Assuming (1) then this is probably a good idea too.
> Yeah. We have a single list per active stream which work by removing
> Complete/satisfied resources from it.

I went to change this and then remembered why I did it this way: the 
sync point is also storing data about the resources that are triggering 
it. Part of this is the RefID and attributes, and we could replace that 
by storing that data in the Resource itself and querying it rather than 
having it passed in via the notification. But the other part is the 
ID/key of those resources, which we _need_ to know in order to update 
the requirements in case one of them has been replaced and thus the 
graph doesn't reflect it yet. (Or, for that matter, we need it to know 
where to go looking for the RefId and/or attributes if they're in the 
DB.) So we have to store some data, we can't just remove items from the 
required list (although we could do that as well).

>> 3) You're suggesting to notify the engine unconditionally and let the engine
>> decide if the list is empty. That's probably not a good idea - not only does it
>> require extra reads, it introduces a race condition that you then have to solve
>> (it can be solved, it's just more work).
>> Since the update to remove a child from the list is atomic, it's best to just
>> trigger the engine only if the list is now empty.
> No. Notify only if stream has something to be processed. The newer
> Approach based on db lock will be that the last resource will initiate its parent.
> This is opposite to what our Aggregator model had suggested.

OK, I think we're on the same page on this one then.

>>> It's not clear to me how the 'streams' differ in practical terms from
>>> just passing a serialisation of the Dependencies object, other than
>>> being incomprehensible to me ;). The current Dependencies
>>> implementation
>>> (1) is a very generic implementation of a DAG, (2) works and has plenty of
>> unit tests, (3) has, with I think one exception, a pretty straightforward API,
>> (4) has a very simple serialisation, returned by the edges() method, which
>> can be passed back into the constructor to recreate it, and (5) has an API that
>> is to some extent relied upon by resources, and so won't likely be removed
>> outright in any event.
>>> Whatever code we need to handle dependencies ought to just build on
>> this existing implementation.
>>> [Murugan, Visnusaran] Our thought was to reduce payload size
>> (template/graph). Just planning for worst case scenario (million resource
>> stack) We could always dump them in ResourceLock.data to be loaded by
>> Worker.
>> If there's a smaller representation of a graph than a list of edges then I don't
>> know what it is. The proposed stream structure certainly isn't it, unless you
>> mean as an alternative to storing the entire graph once for each resource. A
>> better alternative is to store it once centrally - in my current implementation
>> it is passed down through the trigger messages, but since only one traversal
>> can be in progress at a time it could just as easily be stored in the Stack table
>> of the database at the slight cost of an extra write.
> Agree that edge is the smallest representation of a graph. But it does not give
> us a complete picture without doing a DB lookup. Our assumption was to store
> streams in IN_PROGRESS resource_lock.data column. This could be in resource table
> instead.

That's true, but I think in practice at any point where we need to look 
at this we will always have already loaded the Stack from the DB for 
some other reason, so we actually can get it for free. (See detailed 
discussion in my reply to Anant.)

>> I'm not opposed to doing that, BTW. In fact, I'm really interested in your input
>> on how that might help make recovery from failure more robust. I know
>> Anant mentioned that not storing enough data to recover when a node dies
>> was his big concern with my current approach.
> With streams, We feel recovery will be easier. All we need is a trigger :)
>> I can see that by both creating all the sync points at the start of the traversal
>> and storing the dependency graph in the database instead of letting it flow
>> through the RPC messages, we would be able to resume a traversal where it
>> left off, though I'm not sure what that buys us.
>> And I guess what you're suggesting is that by having an explicit lock with the
>> engine ID specified, we can detect when a resource is stuck in IN_PROGRESS
>> due to an engine going down? That's actually pretty interesting.
> Yeah :)
>>> Based on our call on Thursday, I think you're taking the idea of the Lock
>> table too literally. The point of referring to locks is that we can use the same
>> concepts as the Lock table relies on to do atomic updates on a particular row
>> of the database, and we can use those atomic updates to prevent race
>> conditions when implementing SyncPoints/Aggregators/whatever you want
>> to call them. It's not that we'd actually use the Lock table itself, which
>> implements a mutex and therefore offers only a much slower and more
>> stateful way of doing what we want (lock mutex, change data, unlock
>> mutex).
>>> [Murugan, Visnusaran] Are you suggesting something like a select-for-
>> update in resource table itself without having  a lock table?
>> Yes, that's exactly what I was suggesting.
> DB is always good for sync. But we need to be careful not to overdo it.

Yeah, I see what you mean now, it's starting to _feel_ like there'd be 
too many things mixed together in the Resource table. Are you aware of 
some concrete harm that might cause though? What happens if we overdo 
it? Is select-for-update on a huge row more expensive than the whole 
overhead of manipulating the Lock?

Just trying to figure out if intuition is leading me astray here.

> Will update etherpad by tomorrow.

OK, thanks.


More information about the OpenStack-dev mailing list