[openstack-dev] [Heat] Convergence proof-of-concept showdown
Gurjar, Unmesh
unmesh.gurjar at hp.com
Wed Dec 17 18:05:46 UTC 2014
> -----Original Message-----
> From: Zane Bitter [mailto:zbitter at redhat.com]
> Sent: Tuesday, December 16, 2014 9:43 AM
> To: openstack-dev at lists.openstack.org
> Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
> showdown
>
> On 15/12/14 07:47, Murugan, Visnusaran wrote:
> > Hi Zane,
> >
> > We have been going through this chain for quite some time now and we
> still feel a disconnect in our understanding.
>
> Yes, I thought last week that we were on the same page, but now it looks like
> we're drifting off again :(
>
> > Can you put up a etherpad where we can understand your approach.
>
> Maybe you could put up an etherpad with your questions. Practically all of
> the relevant code is in Stack._create_or_update, Stack._dependencies and
> Converger.check_resource. That's 134 lines of code by my count.
> There's not a lot more I can usefully say about it without knowing which parts
> exactly you're stuck on, but I can definitely answer specific questions.
>
> > For example: for storing resource dependencies, Are you storing its
> > name, version tuple or just its ID.
>
> I'm storing a tuple of its name and database ID. The data structure is
> resource.GraphKey. I was originally using the name for something, but I
> suspect I could probably drop it now and just store the database ID, but I
> haven't tried it yet. (Having the name in there definitely makes debugging
> more pleasant though ;)
>
I agree, having name might come in handy while debugging!
> When I build the traversal graph each node is a tuple of the GraphKey and a
> boolean to indicate whether it corresponds to an update or a cleanup
> operation (both can appear for a single resource in the same graph).
Just to confirm my understanding, cleanup operation takes care of both:
1. resources which are deleted as a part of update and
2. previous versioned resource which was updated by replacing with a new
resource (UpdateReplace scenario)
Also, the cleanup operation is performed after the update completes successfully.
>
> > If I am correct, you are updating all resources on update regardless
> > of their change which will be inefficient if stack contains a million resource.
>
> I'm calling update() on all resources regardless of change, but update() will
> only call handle_update() if something has changed (unless the plugin has
> overridden Resource._needs_update()).
>
> There's no way to know whether a resource needs to be updated before
> you're ready to update it, so I don't think of this as 'inefficient', just 'correct'.
>
> > We have similar questions regarding other areas in your
> > implementation, which we believe if we understand the outline of your
> > implementation. It is difficult to get a hold on your approach just by looking
> at code. Docs strings / Etherpad will help.
> >
> >
> > About streams, Yes in a million resource stack, the data will be huge, but
> less than template.
>
> No way, it's O(n^3) (cubed!) in the worst case to store streams for each
> resource.
>
> > Also this stream is stored
> > only In IN_PROGRESS resources.
>
> Now I'm really confused. Where does it come from if the resource doesn't
> get it until it's already in progress? And how will that information help it?
>
When an operation on stack is initiated, the stream will be identified. To begin
the operation, the action is initiated on the leaf (or root) resource(s) and the
stream is stored (only) in this/these IN_PROGRESS resource(s).
The stream should then keep getting passed to the next/previous level of resource(s) as
and when the dependencies for the next/previous level of resource(s) are met.
> > The reason to have entire dependency list to reduce DB queries while a
> stack update.
>
> But we never need to know that. We only need to know what just happened
> and what to do next.
>
As mentioned earlier, each level of resources in a graph pass on the dependency
list/stream to their next/previous level of resources. This is information should further
be used to determine what is to be done next and drive the operation to completion.
> > When you have a singular dependency on each resources similar to your
> > implantation, then we will end up loading Dependencies one at a time and
> altering almost all resource's dependency regardless of their change.
> >
> > Regarding a 2 template approach for delete, it is not actually 2
> > different templates. Its just that we have a delete stream To be taken up
> post update.
>
> That would be a regression from Heat's current behaviour, where we start
> cleaning up resources as soon as they have nothing depending on them.
> There's not even a reason to make it worse than what we already have,
> because it's actually a lot _easier_ to treat update and clean up as the same
> kind of operation and throw both into the same big graph. The dual
> implementations and all of the edge cases go away and you can just trust in
> the graph traversal to do the Right Thing in the most parallel way possible.
>
> > (Any post operation will be handled as an update) This approach is
> > True when Rollback==True We can always fall back to regular stream
> > (non-delete stream) if Rollback=False
>
> I don't understand what you're saying here.
Just to elaborate, in case of update with rollback, there will be 2 streams of
operations:
1. first is the create and update resource stream
2. second is the stream for deleting resources (which will be taken if the first stream
completes successfully).
However, in case of an update without rollback, there will be a single stream of
operation (no delete/cleanup stream required).
>
> > In our view we would like to have only one basic operation and that is
> UPDATE.
> >
> > 1. CREATE will be an update where a realized graph == Empty 2. UPDATE
> > will be an update where realized graph == Full/Partial realized
> > (possibly with a delete stream as post operation if Rollback==True) 3.
> DELETE will be just another update with an empty to_be_realized_graph.
>
> Yes, that goes without saying. In my implementation
> Stack._create_or_update() handles all three operations.
>
> > It would be great if we can freeze a stable approach by mid-week as
> > Christmas vacations are round the corner. :) :)
> >
> >> -----Original Message-----
> >> From: Zane Bitter [mailto:zbitter at redhat.com]
> >> Sent: Saturday, December 13, 2014 5:43 AM
> >> To: openstack-dev at lists.openstack.org
> >> Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
> >> showdown
> >>
> >> On 12/12/14 05:29, Murugan, Visnusaran wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Zane Bitter [mailto:zbitter at redhat.com]
> >>>> Sent: Friday, December 12, 2014 6:37 AM
> >>>> To: openstack-dev at lists.openstack.org
> >>>> Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
> >>>> showdown
> >>>>
> >>>> On 11/12/14 08:26, Murugan, Visnusaran wrote:
> >>>>>>> [Murugan, Visnusaran]
> >>>>>>> In case of rollback where we have to cleanup earlier version of
> >>>>>>> resources,
> >>>>>> we could get the order from old template. We'd prefer not to have
> >>>>>> a graph table.
> >>>>>>
> >>>>>> In theory you could get it by keeping old templates around. But
> >>>>>> that means keeping a lot of templates, and it will be hard to
> >>>>>> keep track of when you want to delete them. It also means that
> >>>>>> when starting an update you'll need to load every existing
> >>>>>> previous version of the template in order to calculate the
> >>>>>> dependencies. It also leaves the dependencies in an ambiguous
> >>>>>> state when a resource fails, and although that can be worked
> >>>>>> around it will be a giant pain to
> >> implement.
> >>>>>>
> >>>>>
> >>>>> Agree that looking to all templates for a delete is not good. But
> >>>>> baring Complexity, we feel we could achieve it by way of having an
> >>>>> update and a delete stream for a stack update operation. I will
> >>>>> elaborate in detail in the etherpad sometime tomorrow :)
> >>>>>
> >>>>>> I agree that I'd prefer not to have a graph table. After trying a
> >>>>>> couple of different things I decided to store the dependencies in
> >>>>>> the Resource table, where we can read or write them virtually for
> >>>>>> free because it turns out that we are always reading or updating
> >>>>>> the Resource itself at exactly the same time anyway.
> >>>>>>
> >>>>>
> >>>>> Not sure how this will work in an update scenario when a resource
> >>>>> does not change and its dependencies do.
> >>>>
> >>>> We'll always update the requirements, even when the properties
> >>>> don't change.
> >>>>
> >>>
> >>> Can you elaborate a bit on rollback.
> >>
> >> I didn't do anything special to handle rollback. It's possible that
> >> we need to - obviously the difference in the UpdateReplace + rollback
> >> case is that the replaced resource is now the one we want to keep,
> >> and yet the replaced_by/replaces dependency will force the newer
> >> (replacement) resource to be checked for deletion first, which is an
> >> inversion of the usual order.
> >>
> >> However, I tried to think of a scenario where that would cause
> >> problems and I couldn't come up with one. Provided we know the
> >> actual, real-world dependencies of each resource I don't think the
> >> ordering of those two checks matters.
> >>
> >> In fact, I currently can't think of a case where the dependency order
> >> between replacement and replaced resources matters at all. It matters
> >> in the current Heat implementation because resources are artificially
> >> segmented into the current and backup stacks, but with a holistic
> >> view of dependencies that may well not be required. I tried taking
> >> that line out of the simulator code and all the tests still passed.
> >> If anybody can think of a scenario in which it would make a difference, I
> would be very interested to hear it.
> >>
> >> In any event though, it should be no problem to reverse the direction
> >> of that one edge in these particular circumstances if it does turn
> >> out to be a problem.
> >>
> >>> We had an approach with depends_on
> >>> and needed_by columns in ResourceTable. But dropped it when we
> >> figured
> >>> out we had too many DB operations for Update.
> >>
> >> Yeah, I initially ran into this problem too - you have a bunch of
> >> nodes that are waiting on the current node, and now you have to go
> >> look them all up in the database to see what else they're waiting on
> >> in order to tell if they're ready to be triggered.
> >>
> >> It turns out the answer is to distribute the writes but centralise
> >> the reads. So at the start of the update, we read all of the
> >> Resources, obtain their dependencies and build one central graph[1].
> >> We than make that graph available to each resource (either by passing
> >> it as a notification parameter, or storing it somewhere central in
> >> the DB that they will all have to read anyway, i.e. the Stack). But
> >> when we update a dependency we don't update the central graph, we
> >> update the individual Resource so there's no global lock required.
> >>
> >> [1]
> >> https://github.com/zaneb/heat-convergence-
> prototype/blob/distributed-
> >> graph/converge/stack.py#L166-L168
> >>
> >>>>> Also taking care of deleting resources in order will be an issue.
> >>>>
> >>>> It works fine.
> >>>>
> >>>>> This implies that there will be different versions of a resource
> >>>>> which will even complicate further.
> >>>>
> >>>> No it doesn't, other than the different versions we already have
> >>>> due to UpdateReplace.
> >>>>
> >>>>>>>> This approach reduces DB queries by waiting for completion
> >>>>>>>> notification
> >>>>>> on a topic. The drawback I see is that delete stack stream will
> >>>>>> be huge as it will have the entire graph. We can always dump such
> >>>>>> data in ResourceLock.data Json and pass a simple flag
> >>>>>> "load_stream_from_db" to converge RPC call as a workaround for
> >>>>>> delete
> >>>> operation.
> >>>>>>>
> >>>>>>> This seems to be essentially equivalent to my 'SyncPoint'
> >>>>>>> proposal[1], with
> >>>>>> the key difference that the data is stored in-memory in a Heat
> >>>>>> engine rather than the database.
> >>>>>>>
> >>>>>>> I suspect it's probably a mistake to move it in-memory for
> >>>>>>> similar reasons to the argument Clint made against synchronising
> >>>>>>> the marking off
> >>>>>> of dependencies in-memory. The database can handle that and the
> >>>>>> problem of making the DB robust against failures of a single
> >>>>>> machine has already been solved by someone else. If we do it
> >>>>>> in-memory we are just creating a single point of failure for not
> >>>>>> much gain. (I guess you could argue it doesn't matter, since if
> >>>>>> any Heat engine dies during the traversal then we'll have to kick
> >>>>>> off another one anyway, but it does limit our options if that
> >>>>>> changes in the
> >>>>>> future.) [Murugan, Visnusaran] Resource completes, removes itself
> >>>>>> from resource_lock and notifies engine. Engine will acquire
> >>>>>> parent lock and initiate parent only if all its children are
> >>>>>> satisfied (no child entry in
> >>>> resource_lock).
> >>>>>> This will come in place of Aggregator.
> >>>>>>
> >>>>>> Yep, if you s/resource_lock/SyncPoint/ that's more or less
> >>>>>> exactly what I
> >>>> did.
> >>>>>> The three differences I can see are:
> >>>>>>
> >>>>>> 1) I think you are proposing to create all of the sync points at
> >>>>>> the start of the traversal, rather than on an as-needed basis.
> >>>>>> This is probably a good idea. I didn't consider it because of the
> >>>>>> way my prototype evolved, but there's now no reason I can see not
> to do this.
> >>>>>> If we could move the data to the Resource table itself then we
> >>>>>> could even get it for free from an efficiency point of view.
> >>>>>
> >>>>> +1. But we will need engine_id to be stored somewhere for recovery
> >>>> purpose (easy to be queried format).
> >>>>
> >>>> Yeah, so I'm starting to think you're right, maybe the/a Lock table
> >>>> is the right thing to use there. We could probably do it within the
> >>>> resource table using the same select-for-update to set the
> >>>> engine_id, but I agree that we might be starting to jam too much into
> that one table.
> >>>>
> >>>
> >>> yeah. Unrelated values in resource table. Upon resource completion
> >>> we have to unset engine_id as well as compared to dropping a row
> >>> from
> >> resource lock.
> >>> Both are good. Having engine_id in resource_table will reduce db
> >>> operaions in half. We should go with just resource table along with
> >> engine_id.
> >>
> >> OK
> >>
> >>>>> Sync points are created as-needed. Single resource is enough to
> >>>>> restart
> >>>> that entire stream.
> >>>>> I think there is a disconnect in our understanding. I will detail
> >>>>> it as well in
> >>>> the etherpad.
> >>>>
> >>>> OK, that would be good.
> >>>>
> >>>>>> 2) You're using a single list from which items are removed,
> >>>>>> rather than two lists (one static, and one to which items are
> >>>>>> added) that get
> >>>> compared.
> >>>>>> Assuming (1) then this is probably a good idea too.
> >>>>>
> >>>>> Yeah. We have a single list per active stream which work by
> >>>>> removing Complete/satisfied resources from it.
> >>>>
> >>>> I went to change this and then remembered why I did it this way:
> >>>> the sync point is also storing data about the resources that are
> >>>> triggering it. Part of this is the RefID and attributes, and we
> >>>> could replace that by storing that data in the Resource itself and
> >>>> querying it rather than having it passed in via the notification.
> >>>> But the other part is the ID/key of those resources, which we
> >>>> _need_ to know in order to update the requirements in case one of
> >>>> them has been replaced and thus the graph doesn't reflect it yet.
> >>>> (Or, for that matter, we need it to know where to go looking for
> >>>> the RefId and/or attributes if they're in the
> >>>> DB.) So we have to store some data, we can't just remove items from
> >>>> the required list (although we could do that as well).
> >>>>
> >>>>>> 3) You're suggesting to notify the engine unconditionally and let
> >>>>>> the engine decide if the list is empty. That's probably not a
> >>>>>> good idea - not only does it require extra reads, it introduces a
> >>>>>> race condition that you then have to solve (it can be solved,
> >>>>>> it's just more
> >> work).
> >>>>>> Since the update to remove a child from the list is atomic, it's
> >>>>>> best to just trigger the engine only if the list is now empty.
> >>>>>>
> >>>>>
> >>>>> No. Notify only if stream has something to be processed. The newer
> >>>>> Approach based on db lock will be that the last resource will
> >>>>> initiate its
> >>>> parent.
> >>>>> This is opposite to what our Aggregator model had suggested.
> >>>>
> >>>> OK, I think we're on the same page on this one then.
> >>>>
> >>>
> >>>
> >>> Yeah.
> >>>
> >>>>>>> It's not clear to me how the 'streams' differ in practical terms
> >>>>>>> from just passing a serialisation of the Dependencies object,
> >>>>>>> other than being incomprehensible to me ;). The current
> >>>>>>> Dependencies implementation
> >>>>>>> (1) is a very generic implementation of a DAG, (2) works and has
> >>>>>>> plenty of
> >>>>>> unit tests, (3) has, with I think one exception, a pretty
> >>>>>> straightforward API,
> >>>>>> (4) has a very simple serialisation, returned by the edges()
> >>>>>> method, which can be passed back into the constructor to recreate
> >>>>>> it, and (5) has an API that is to some extent relied upon by
> >>>>>> resources, and so won't likely be removed outright in any event.
> >>>>>>> Whatever code we need to handle dependencies ought to just
> build
> >>>>>>> on
> >>>>>> this existing implementation.
> >>>>>>> [Murugan, Visnusaran] Our thought was to reduce payload size
> >>>>>> (template/graph). Just planning for worst case scenario (million
> >>>>>> resource
> >>>>>> stack) We could always dump them in ResourceLock.data to be
> >>>>>> loaded by Worker.
> >>
> >> With the latest updates to the Etherpad, I'm even more confused by
> >> streams than I was before.
> >>
> >> One thing I never understood is why do you need to store the whole
> >> path to reach each node in the graph? Surely you only need to know
> >> the nodes this one is waiting on, the nodes waiting on this one and
> >> the ones those are waiting on, not the entire history up to this
> >> point. The size of each stream is theoretically up to O(n^2) and
> >> you're storing n of them - that's going to get painful in this million-
> resource stack.
> >>
> >>>>>> If there's a smaller representation of a graph than a list of
> >>>>>> edges then I don't know what it is. The proposed stream structure
> >>>>>> certainly isn't it, unless you mean as an alternative to storing
> >>>>>> the entire graph once for each resource. A better alternative is
> >>>>>> to store it once centrally - in my current implementation it is
> >>>>>> passed down through the trigger messages, but since only one
> >>>>>> traversal can be in progress at a time it could just as easily be
> >>>>>> stored in the Stack table of the
> >>>> database at the slight cost of an extra write.
> >>>>>>
> >>>>>
> >>>>> Agree that edge is the smallest representation of a graph. But it
> >>>>> does not give us a complete picture without doing a DB lookup. Our
> >>>>> assumption was to store streams in IN_PROGRESS resource_lock.data
> >>>>> column. This could be in resource table instead.
> >>>>
> >>>> That's true, but I think in practice at any point where we need to
> >>>> look at this we will always have already loaded the Stack from the
> >>>> DB for some other reason, so we actually can get it for free. (See
> >>>> detailed discussion in my reply to Anant.)
> >>>>
> >>>
> >>> Aren't we planning to stop loading stack with all resource objects
> >>> in future to Address scalability concerns we currently have?
> >>
> >> We plan on not loading all of the Resource objects each time we load
> >> the Stack object, but I think we will always need to have loaded the
> >> Stack object (for example, we'll need to check the current traversal
> >> ID, amongst other reasons). So if the serialised dependency graph is
> >> stored in the Stack it will be no big deal.
> >>
> >>>>>> I'm not opposed to doing that, BTW. In fact, I'm really
> >>>>>> interested in your input on how that might help make recovery
> >>>>>> from failure more robust. I know Anant mentioned that not storing
> >>>>>> enough data to recover when a node dies was his big concern with
> >>>>>> my current
> >> approach.
> >>>>>>
> >>>>>
> >>>>> With streams, We feel recovery will be easier. All we need is a
> >>>>> trigger :)
> >>>>>
> >>>>>> I can see that by both creating all the sync points at the start
> >>>>>> of the traversal and storing the dependency graph in the database
> >>>>>> instead of letting it flow through the RPC messages, we would be
> >>>>>> able to resume a traversal where it left off, though I'm not sure
> >>>>>> what that buys
> >>>> us.
> >>>>>>
> >>>>>> And I guess what you're suggesting is that by having an explicit
> >>>>>> lock with the engine ID specified, we can detect when a resource
> >>>>>> is stuck in IN_PROGRESS due to an engine going down? That's
> >>>>>> actually pretty
> >>>> interesting.
> >>>>>>
> >>>>>
> >>>>> Yeah :)
> >>>>>
> >>>>>>> Based on our call on Thursday, I think you're taking the idea of
> >>>>>>> the Lock
> >>>>>> table too literally. The point of referring to locks is that we
> >>>>>> can use the same concepts as the Lock table relies on to do
> >>>>>> atomic updates on a particular row of the database, and we can
> >>>>>> use those atomic updates to prevent race conditions when
> >>>>>> implementing SyncPoints/Aggregators/whatever you want to call
> >>>>>> them. It's not that we'd actually use the Lock table itself,
> >>>>>> which implements a mutex and therefore offers only a much slower
> >>>>>> and more stateful way of doing what we want (lock mutex, change
> data, unlock mutex).
> >>>>>>> [Murugan, Visnusaran] Are you suggesting something like a
> >>>>>>> select-for-
> >>>>>> update in resource table itself without having a lock table?
> >>>>>>
> >>>>>> Yes, that's exactly what I was suggesting.
> >>>>>
> >>>>> DB is always good for sync. But we need to be careful not to overdo it.
> >>>>
> >>>> Yeah, I see what you mean now, it's starting to _feel_ like there'd
> >>>> be too many things mixed together in the Resource table. Are you
> >>>> aware of some concrete harm that might cause though? What happens
> >>>> if we overdo it? Is select-for-update on a huge row more expensive
> >>>> than the whole overhead of manipulating the Lock?
> >>>>
> >>>> Just trying to figure out if intuition is leading me astray here.
> >>>>
> >>>
> >>> You are right. There should be no difference apart from little bump
> >>> In memory usage. But I think it should be fine.
> >>>
> >>>>> Will update etherpad by tomorrow.
> >>>>
> >>>> OK, thanks.
> >>>>
> >>>> cheers,
> >>>> Zane.
>
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Thanks,
Unmesh G.
irc: unmeshg
More information about the OpenStack-dev
mailing list