[openstack-dev] [Heat] Convergence proof-of-concept showdown
Murugan, Visnusaran
visnusaran.murugan at hp.com
Mon Dec 15 12:47:47 UTC 2014
Hi Zane,
We have been going through this chain for quite some time now and we still feel a disconnect in our understanding.
Can you put up a etherpad where we can understand your approach. For example: for storing resource dependencies,
Are you storing its name, version tuple or just its ID. If I am correct, you are updating all resources on update regardless
of their change which will be inefficient if stack contains a million resource. We have similar questions regarding other
areas in your implementation, which we believe if we understand the outline of your implementation. It is difficult to get
a hold on your approach just by looking at code. Docs strings / Etherpad will help.
About streams, Yes in a million resource stack, the data will be huge, but less than template. Also this stream is stored
only In IN_PROGRESS resources. The reason to have entire dependency list to reduce DB queries while a stack update.
When you have a singular dependency on each resources similar to your implantation, then we will end up loading
Dependencies one at a time and altering almost all resource's dependency regardless of their change.
Regarding a 2 template approach for delete, it is not actually 2 different templates. Its just that we have a delete stream
To be taken up post update. (Any post operation will be handled as an update) This approach is True when Rollback==True
We can always fall back to regular stream (non-delete stream) if Rollback=False
In our view we would like to have only one basic operation and that is UPDATE.
1. CREATE will be an update where a realized graph == Empty
2. UPDATE will be an update where realized graph == Full/Partial realized (possibly with a delete stream as post operation if Rollback==True)
3. DELETE will be just another update with an empty to_be_realized_graph.
It would be great if we can freeze a stable approach by mid-week as Christmas vacations are round the corner. :) :)
> -----Original Message-----
> From: Zane Bitter [mailto:zbitter at redhat.com]
> Sent: Saturday, December 13, 2014 5:43 AM
> To: openstack-dev at lists.openstack.org
> Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
> showdown
>
> On 12/12/14 05:29, Murugan, Visnusaran wrote:
> >
> >
> >> -----Original Message-----
> >> From: Zane Bitter [mailto:zbitter at redhat.com]
> >> Sent: Friday, December 12, 2014 6:37 AM
> >> To: openstack-dev at lists.openstack.org
> >> Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
> >> showdown
> >>
> >> On 11/12/14 08:26, Murugan, Visnusaran wrote:
> >>>>> [Murugan, Visnusaran]
> >>>>> In case of rollback where we have to cleanup earlier version of
> >>>>> resources,
> >>>> we could get the order from old template. We'd prefer not to have a
> >>>> graph table.
> >>>>
> >>>> In theory you could get it by keeping old templates around. But
> >>>> that means keeping a lot of templates, and it will be hard to keep
> >>>> track of when you want to delete them. It also means that when
> >>>> starting an update you'll need to load every existing previous
> >>>> version of the template in order to calculate the dependencies. It
> >>>> also leaves the dependencies in an ambiguous state when a resource
> >>>> fails, and although that can be worked around it will be a giant pain to
> implement.
> >>>>
> >>>
> >>> Agree that looking to all templates for a delete is not good. But
> >>> baring Complexity, we feel we could achieve it by way of having an
> >>> update and a delete stream for a stack update operation. I will
> >>> elaborate in detail in the etherpad sometime tomorrow :)
> >>>
> >>>> I agree that I'd prefer not to have a graph table. After trying a
> >>>> couple of different things I decided to store the dependencies in
> >>>> the Resource table, where we can read or write them virtually for
> >>>> free because it turns out that we are always reading or updating
> >>>> the Resource itself at exactly the same time anyway.
> >>>>
> >>>
> >>> Not sure how this will work in an update scenario when a resource
> >>> does not change and its dependencies do.
> >>
> >> We'll always update the requirements, even when the properties don't
> >> change.
> >>
> >
> > Can you elaborate a bit on rollback.
>
> I didn't do anything special to handle rollback. It's possible that we need to -
> obviously the difference in the UpdateReplace + rollback case is that the
> replaced resource is now the one we want to keep, and yet the
> replaced_by/replaces dependency will force the newer (replacement)
> resource to be checked for deletion first, which is an inversion of the usual
> order.
>
> However, I tried to think of a scenario where that would cause problems and
> I couldn't come up with one. Provided we know the actual, real-world
> dependencies of each resource I don't think the ordering of those two
> checks matters.
>
> In fact, I currently can't think of a case where the dependency order
> between replacement and replaced resources matters at all. It matters in the
> current Heat implementation because resources are artificially segmented
> into the current and backup stacks, but with a holistic view of dependencies
> that may well not be required. I tried taking that line out of the simulator
> code and all the tests still passed. If anybody can think of a scenario in which
> it would make a difference, I would be very interested to hear it.
>
> In any event though, it should be no problem to reverse the direction of that
> one edge in these particular circumstances if it does turn out to be a
> problem.
>
> > We had an approach with depends_on
> > and needed_by columns in ResourceTable. But dropped it when we
> figured
> > out we had too many DB operations for Update.
>
> Yeah, I initially ran into this problem too - you have a bunch of nodes that are
> waiting on the current node, and now you have to go look them all up in the
> database to see what else they're waiting on in order to tell if they're ready
> to be triggered.
>
> It turns out the answer is to distribute the writes but centralise the reads. So
> at the start of the update, we read all of the Resources, obtain their
> dependencies and build one central graph[1]. We than make that graph
> available to each resource (either by passing it as a notification parameter, or
> storing it somewhere central in the DB that they will all have to read anyway,
> i.e. the Stack). But when we update a dependency we don't update the
> central graph, we update the individual Resource so there's no global lock
> required.
>
> [1]
> https://github.com/zaneb/heat-convergence-prototype/blob/distributed-
> graph/converge/stack.py#L166-L168
>
> >>> Also taking care of deleting resources in order will be an issue.
> >>
> >> It works fine.
> >>
> >>> This implies that there will be different versions of a resource
> >>> which will even complicate further.
> >>
> >> No it doesn't, other than the different versions we already have due
> >> to UpdateReplace.
> >>
> >>>>>> This approach reduces DB queries by waiting for completion
> >>>>>> notification
> >>>> on a topic. The drawback I see is that delete stack stream will be
> >>>> huge as it will have the entire graph. We can always dump such data
> >>>> in ResourceLock.data Json and pass a simple flag
> >>>> "load_stream_from_db" to converge RPC call as a workaround for
> >>>> delete
> >> operation.
> >>>>>
> >>>>> This seems to be essentially equivalent to my 'SyncPoint'
> >>>>> proposal[1], with
> >>>> the key difference that the data is stored in-memory in a Heat
> >>>> engine rather than the database.
> >>>>>
> >>>>> I suspect it's probably a mistake to move it in-memory for similar
> >>>>> reasons to the argument Clint made against synchronising the
> >>>>> marking off
> >>>> of dependencies in-memory. The database can handle that and the
> >>>> problem of making the DB robust against failures of a single
> >>>> machine has already been solved by someone else. If we do it
> >>>> in-memory we are just creating a single point of failure for not
> >>>> much gain. (I guess you could argue it doesn't matter, since if any
> >>>> Heat engine dies during the traversal then we'll have to kick off
> >>>> another one anyway, but it does limit our options if that changes
> >>>> in the
> >>>> future.) [Murugan, Visnusaran] Resource completes, removes itself
> >>>> from resource_lock and notifies engine. Engine will acquire parent
> >>>> lock and initiate parent only if all its children are satisfied (no
> >>>> child entry in
> >> resource_lock).
> >>>> This will come in place of Aggregator.
> >>>>
> >>>> Yep, if you s/resource_lock/SyncPoint/ that's more or less exactly
> >>>> what I
> >> did.
> >>>> The three differences I can see are:
> >>>>
> >>>> 1) I think you are proposing to create all of the sync points at
> >>>> the start of the traversal, rather than on an as-needed basis. This
> >>>> is probably a good idea. I didn't consider it because of the way my
> >>>> prototype evolved, but there's now no reason I can see not to do this.
> >>>> If we could move the data to the Resource table itself then we
> >>>> could even get it for free from an efficiency point of view.
> >>>
> >>> +1. But we will need engine_id to be stored somewhere for recovery
> >> purpose (easy to be queried format).
> >>
> >> Yeah, so I'm starting to think you're right, maybe the/a Lock table
> >> is the right thing to use there. We could probably do it within the
> >> resource table using the same select-for-update to set the engine_id,
> >> but I agree that we might be starting to jam too much into that one table.
> >>
> >
> > yeah. Unrelated values in resource table. Upon resource completion we
> > have to unset engine_id as well as compared to dropping a row from
> resource lock.
> > Both are good. Having engine_id in resource_table will reduce db
> > operaions in half. We should go with just resource table along with
> engine_id.
>
> OK
>
> >>> Sync points are created as-needed. Single resource is enough to
> >>> restart
> >> that entire stream.
> >>> I think there is a disconnect in our understanding. I will detail it
> >>> as well in
> >> the etherpad.
> >>
> >> OK, that would be good.
> >>
> >>>> 2) You're using a single list from which items are removed, rather
> >>>> than two lists (one static, and one to which items are added) that
> >>>> get
> >> compared.
> >>>> Assuming (1) then this is probably a good idea too.
> >>>
> >>> Yeah. We have a single list per active stream which work by removing
> >>> Complete/satisfied resources from it.
> >>
> >> I went to change this and then remembered why I did it this way: the
> >> sync point is also storing data about the resources that are
> >> triggering it. Part of this is the RefID and attributes, and we could
> >> replace that by storing that data in the Resource itself and querying
> >> it rather than having it passed in via the notification. But the
> >> other part is the ID/key of those resources, which we _need_ to know
> >> in order to update the requirements in case one of them has been
> >> replaced and thus the graph doesn't reflect it yet. (Or, for that
> >> matter, we need it to know where to go looking for the RefId and/or
> >> attributes if they're in the
> >> DB.) So we have to store some data, we can't just remove items from
> >> the required list (although we could do that as well).
> >>
> >>>> 3) You're suggesting to notify the engine unconditionally and let
> >>>> the engine decide if the list is empty. That's probably not a good
> >>>> idea - not only does it require extra reads, it introduces a race
> >>>> condition that you then have to solve (it can be solved, it's just more
> work).
> >>>> Since the update to remove a child from the list is atomic, it's
> >>>> best to just trigger the engine only if the list is now empty.
> >>>>
> >>>
> >>> No. Notify only if stream has something to be processed. The newer
> >>> Approach based on db lock will be that the last resource will
> >>> initiate its
> >> parent.
> >>> This is opposite to what our Aggregator model had suggested.
> >>
> >> OK, I think we're on the same page on this one then.
> >>
> >
> >
> > Yeah.
> >
> >>>>> It's not clear to me how the 'streams' differ in practical terms
> >>>>> from just passing a serialisation of the Dependencies object,
> >>>>> other than being incomprehensible to me ;). The current
> >>>>> Dependencies implementation
> >>>>> (1) is a very generic implementation of a DAG, (2) works and has
> >>>>> plenty of
> >>>> unit tests, (3) has, with I think one exception, a pretty
> >>>> straightforward API,
> >>>> (4) has a very simple serialisation, returned by the edges()
> >>>> method, which can be passed back into the constructor to recreate
> >>>> it, and (5) has an API that is to some extent relied upon by
> >>>> resources, and so won't likely be removed outright in any event.
> >>>>> Whatever code we need to handle dependencies ought to just build
> >>>>> on
> >>>> this existing implementation.
> >>>>> [Murugan, Visnusaran] Our thought was to reduce payload size
> >>>> (template/graph). Just planning for worst case scenario (million
> >>>> resource
> >>>> stack) We could always dump them in ResourceLock.data to be loaded
> >>>> by Worker.
>
> With the latest updates to the Etherpad, I'm even more confused by streams
> than I was before.
>
> One thing I never understood is why do you need to store the whole path to
> reach each node in the graph? Surely you only need to know the nodes this
> one is waiting on, the nodes waiting on this one and the ones those are
> waiting on, not the entire history up to this point. The size of each stream is
> theoretically up to O(n^2) and you're storing n of them - that's going to get
> painful in this million-resource stack.
>
> >>>> If there's a smaller representation of a graph than a list of edges
> >>>> then I don't know what it is. The proposed stream structure
> >>>> certainly isn't it, unless you mean as an alternative to storing
> >>>> the entire graph once for each resource. A better alternative is to
> >>>> store it once centrally - in my current implementation it is passed
> >>>> down through the trigger messages, but since only one traversal can
> >>>> be in progress at a time it could just as easily be stored in the
> >>>> Stack table of the
> >> database at the slight cost of an extra write.
> >>>>
> >>>
> >>> Agree that edge is the smallest representation of a graph. But it
> >>> does not give us a complete picture without doing a DB lookup. Our
> >>> assumption was to store streams in IN_PROGRESS resource_lock.data
> >>> column. This could be in resource table instead.
> >>
> >> That's true, but I think in practice at any point where we need to
> >> look at this we will always have already loaded the Stack from the DB
> >> for some other reason, so we actually can get it for free. (See
> >> detailed discussion in my reply to Anant.)
> >>
> >
> > Aren't we planning to stop loading stack with all resource objects in
> > future to Address scalability concerns we currently have?
>
> We plan on not loading all of the Resource objects each time we load the
> Stack object, but I think we will always need to have loaded the Stack object
> (for example, we'll need to check the current traversal ID, amongst other
> reasons). So if the serialised dependency graph is stored in the Stack it will be
> no big deal.
>
> >>>> I'm not opposed to doing that, BTW. In fact, I'm really interested
> >>>> in your input on how that might help make recovery from failure
> >>>> more robust. I know Anant mentioned that not storing enough data to
> >>>> recover when a node dies was his big concern with my current
> approach.
> >>>>
> >>>
> >>> With streams, We feel recovery will be easier. All we need is a
> >>> trigger :)
> >>>
> >>>> I can see that by both creating all the sync points at the start of
> >>>> the traversal and storing the dependency graph in the database
> >>>> instead of letting it flow through the RPC messages, we would be
> >>>> able to resume a traversal where it left off, though I'm not sure
> >>>> what that buys
> >> us.
> >>>>
> >>>> And I guess what you're suggesting is that by having an explicit
> >>>> lock with the engine ID specified, we can detect when a resource is
> >>>> stuck in IN_PROGRESS due to an engine going down? That's actually
> >>>> pretty
> >> interesting.
> >>>>
> >>>
> >>> Yeah :)
> >>>
> >>>>> Based on our call on Thursday, I think you're taking the idea of
> >>>>> the Lock
> >>>> table too literally. The point of referring to locks is that we can
> >>>> use the same concepts as the Lock table relies on to do atomic
> >>>> updates on a particular row of the database, and we can use those
> >>>> atomic updates to prevent race conditions when implementing
> >>>> SyncPoints/Aggregators/whatever you want to call them. It's not
> >>>> that we'd actually use the Lock table itself, which implements a
> >>>> mutex and therefore offers only a much slower and more stateful way
> >>>> of doing what we want (lock mutex, change data, unlock mutex).
> >>>>> [Murugan, Visnusaran] Are you suggesting something like a
> >>>>> select-for-
> >>>> update in resource table itself without having a lock table?
> >>>>
> >>>> Yes, that's exactly what I was suggesting.
> >>>
> >>> DB is always good for sync. But we need to be careful not to overdo it.
> >>
> >> Yeah, I see what you mean now, it's starting to _feel_ like there'd
> >> be too many things mixed together in the Resource table. Are you
> >> aware of some concrete harm that might cause though? What happens if
> >> we overdo it? Is select-for-update on a huge row more expensive than
> >> the whole overhead of manipulating the Lock?
> >>
> >> Just trying to figure out if intuition is leading me astray here.
> >>
> >
> > You are right. There should be no difference apart from little bump In
> > memory usage. But I think it should be fine.
> >
> >>> Will update etherpad by tomorrow.
> >>
> >> OK, thanks.
> >>
> >> cheers,
> >> Zane.
> >>
> >> _______________________________________________
> >> OpenStack-dev mailing list
> >> OpenStack-dev at lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
More information about the OpenStack-dev
mailing list