[openstack-dev] [Heat] Convergence proof-of-concept showdown

Murugan, Visnusaran visnusaran.murugan at hp.com
Fri Dec 12 10:29:32 UTC 2014

> -----Original Message-----
> From: Zane Bitter [mailto:zbitter at redhat.com]
> Sent: Friday, December 12, 2014 6:37 AM
> To: openstack-dev at lists.openstack.org
> Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
> showdown
> On 11/12/14 08:26, Murugan, Visnusaran wrote:
> >>> [Murugan, Visnusaran]
> >>> In case of rollback where we have to cleanup earlier version of
> >>> resources,
> >> we could get the order from old template. We'd prefer not to have a
> >> graph table.
> >>
> >> In theory you could get it by keeping old templates around. But that
> >> means keeping a lot of templates, and it will be hard to keep track
> >> of when you want to delete them. It also means that when starting an
> >> update you'll need to load every existing previous version of the
> >> template in order to calculate the dependencies. It also leaves the
> >> dependencies in an ambiguous state when a resource fails, and
> >> although that can be worked around it will be a giant pain to implement.
> >>
> >
> > Agree that looking to all templates for a delete is not good. But
> > baring Complexity, we feel we could achieve it by way of having an
> > update and a delete stream for a stack update operation. I will
> > elaborate in detail in the etherpad sometime tomorrow :)
> >
> >> I agree that I'd prefer not to have a graph table. After trying a
> >> couple of different things I decided to store the dependencies in the
> >> Resource table, where we can read or write them virtually for free
> >> because it turns out that we are always reading or updating the
> >> Resource itself at exactly the same time anyway.
> >>
> >
> > Not sure how this will work in an update scenario when a resource does
> > not change and its dependencies do.
> We'll always update the requirements, even when the properties don't
> change.

Can you elaborate a bit on rollback.  We had an approach with depends_on
and needed_by columns in ResourceTable. But dropped it when we figured out
we had too many DB operations for Update.

> > Also taking care of deleting resources in order will be an issue.
> It works fine.
> > This implies that there will be different versions of a resource which
> > will even complicate further.
> No it doesn't, other than the different versions we already have due to
> UpdateReplace.
> >>>> This approach reduces DB queries by waiting for completion
> >>>> notification
> >> on a topic. The drawback I see is that delete stack stream will be
> >> huge as it will have the entire graph. We can always dump such data
> >> in ResourceLock.data Json and pass a simple flag
> >> "load_stream_from_db" to converge RPC call as a workaround for delete
> operation.
> >>>
> >>> This seems to be essentially equivalent to my 'SyncPoint'
> >>> proposal[1], with
> >> the key difference that the data is stored in-memory in a Heat engine
> >> rather than the database.
> >>>
> >>> I suspect it's probably a mistake to move it in-memory for similar
> >>> reasons to the argument Clint made against synchronising the marking
> >>> off
> >> of dependencies in-memory. The database can handle that and the
> >> problem of making the DB robust against failures of a single machine
> >> has already been solved by someone else. If we do it in-memory we are
> >> just creating a single point of failure for not much gain. (I guess
> >> you could argue it doesn't matter, since if any Heat engine dies
> >> during the traversal then we'll have to kick off another one anyway,
> >> but it does limit our options if that changes in the
> >> future.) [Murugan, Visnusaran] Resource completes, removes itself
> >> from resource_lock and notifies engine. Engine will acquire parent
> >> lock and initiate parent only if all its children are satisfied (no child entry in
> resource_lock).
> >> This will come in place of Aggregator.
> >>
> >> Yep, if you s/resource_lock/SyncPoint/ that's more or less exactly what I
> did.
> >> The three differences I can see are:
> >>
> >> 1) I think you are proposing to create all of the sync points at the
> >> start of the traversal, rather than on an as-needed basis. This is
> >> probably a good idea. I didn't consider it because of the way my
> >> prototype evolved, but there's now no reason I can see not to do this.
> >> If we could move the data to the Resource table itself then we could
> >> even get it for free from an efficiency point of view.
> >
> > +1. But we will need engine_id to be stored somewhere for recovery
> purpose (easy to be queried format).
> Yeah, so I'm starting to think you're right, maybe the/a Lock table is the right
> thing to use there. We could probably do it within the resource table using
> the same select-for-update to set the engine_id, but I agree that we might
> be starting to jam too much into that one table.

yeah. Unrelated values in resource table. Upon resource completion we have to 
unset engine_id as well as compared to dropping a row from resource lock.
Both are good. Having engine_id in resource_table will reduce db operaions
in half. We should go with just resource table along with engine_id.

> > Sync points are created as-needed. Single resource is enough to restart
> that entire stream.
> > I think there is a disconnect in our understanding. I will detail it as well in
> the etherpad.
> OK, that would be good.
> >> 2) You're using a single list from which items are removed, rather
> >> than two lists (one static, and one to which items are added) that get
> compared.
> >> Assuming (1) then this is probably a good idea too.
> >
> > Yeah. We have a single list per active stream which work by removing
> > Complete/satisfied resources from it.
> I went to change this and then remembered why I did it this way: the sync
> point is also storing data about the resources that are triggering it. Part of this
> is the RefID and attributes, and we could replace that by storing that data in
> the Resource itself and querying it rather than having it passed in via the
> notification. But the other part is the ID/key of those resources, which we
> _need_ to know in order to update the requirements in case one of them
> has been replaced and thus the graph doesn't reflect it yet. (Or, for that
> matter, we need it to know where to go looking for the RefId and/or
> attributes if they're in the
> DB.) So we have to store some data, we can't just remove items from the
> required list (although we could do that as well).
> >> 3) You're suggesting to notify the engine unconditionally and let the
> >> engine decide if the list is empty. That's probably not a good idea -
> >> not only does it require extra reads, it introduces a race condition
> >> that you then have to solve (it can be solved, it's just more work).
> >> Since the update to remove a child from the list is atomic, it's best
> >> to just trigger the engine only if the list is now empty.
> >>
> >
> > No. Notify only if stream has something to be processed. The newer
> > Approach based on db lock will be that the last resource will initiate its
> parent.
> > This is opposite to what our Aggregator model had suggested.
> OK, I think we're on the same page on this one then.


> >>> It's not clear to me how the 'streams' differ in practical terms
> >>> from just passing a serialisation of the Dependencies object, other
> >>> than being incomprehensible to me ;). The current Dependencies
> >>> implementation
> >>> (1) is a very generic implementation of a DAG, (2) works and has
> >>> plenty of
> >> unit tests, (3) has, with I think one exception, a pretty
> >> straightforward API,
> >> (4) has a very simple serialisation, returned by the edges() method,
> >> which can be passed back into the constructor to recreate it, and (5)
> >> has an API that is to some extent relied upon by resources, and so
> >> won't likely be removed outright in any event.
> >>> Whatever code we need to handle dependencies ought to just build on
> >> this existing implementation.
> >>> [Murugan, Visnusaran] Our thought was to reduce payload size
> >> (template/graph). Just planning for worst case scenario (million
> >> resource
> >> stack) We could always dump them in ResourceLock.data to be loaded by
> >> Worker.
> >>
> >> If there's a smaller representation of a graph than a list of edges
> >> then I don't know what it is. The proposed stream structure certainly
> >> isn't it, unless you mean as an alternative to storing the entire
> >> graph once for each resource. A better alternative is to store it
> >> once centrally - in my current implementation it is passed down
> >> through the trigger messages, but since only one traversal can be in
> >> progress at a time it could just as easily be stored in the Stack table of the
> database at the slight cost of an extra write.
> >>
> >
> > Agree that edge is the smallest representation of a graph. But it does
> > not give us a complete picture without doing a DB lookup. Our
> > assumption was to store streams in IN_PROGRESS resource_lock.data
> > column. This could be in resource table instead.
> That's true, but I think in practice at any point where we need to look at this
> we will always have already loaded the Stack from the DB for some other
> reason, so we actually can get it for free. (See detailed discussion in my reply
> to Anant.)

Aren't we planning to stop loading stack with all resource objects in future to
Address scalability concerns we currently have?

> >> I'm not opposed to doing that, BTW. In fact, I'm really interested in
> >> your input on how that might help make recovery from failure more
> >> robust. I know Anant mentioned that not storing enough data to
> >> recover when a node dies was his big concern with my current approach.
> >>
> >
> > With streams, We feel recovery will be easier. All we need is a
> > trigger :)
> >
> >> I can see that by both creating all the sync points at the start of
> >> the traversal and storing the dependency graph in the database
> >> instead of letting it flow through the RPC messages, we would be able
> >> to resume a traversal where it left off, though I'm not sure what that buys
> us.
> >>
> >> And I guess what you're suggesting is that by having an explicit lock
> >> with the engine ID specified, we can detect when a resource is stuck
> >> in IN_PROGRESS due to an engine going down? That's actually pretty
> interesting.
> >>
> >
> > Yeah :)
> >
> >>> Based on our call on Thursday, I think you're taking the idea of the
> >>> Lock
> >> table too literally. The point of referring to locks is that we can
> >> use the same concepts as the Lock table relies on to do atomic
> >> updates on a particular row of the database, and we can use those
> >> atomic updates to prevent race conditions when implementing
> >> SyncPoints/Aggregators/whatever you want to call them. It's not that
> >> we'd actually use the Lock table itself, which implements a mutex and
> >> therefore offers only a much slower and more stateful way of doing
> >> what we want (lock mutex, change data, unlock mutex).
> >>> [Murugan, Visnusaran] Are you suggesting something like a
> >>> select-for-
> >> update in resource table itself without having  a lock table?
> >>
> >> Yes, that's exactly what I was suggesting.
> >
> > DB is always good for sync. But we need to be careful not to overdo it.
> Yeah, I see what you mean now, it's starting to _feel_ like there'd be too
> many things mixed together in the Resource table. Are you aware of some
> concrete harm that might cause though? What happens if we overdo it? Is
> select-for-update on a huge row more expensive than the whole overhead
> of manipulating the Lock?
> Just trying to figure out if intuition is leading me astray here.

You are right. There should be no difference apart from little bump
In memory usage. But I think it should be fine.

> > Will update etherpad by tomorrow.
> OK, thanks.
> cheers,
> Zane.
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

More information about the OpenStack-dev mailing list