[openstack-dev] [Heat] Convergence proof-of-concept showdown

Anant Patil anant.patil at hp.com
Thu Jan 8 12:24:52 UTC 2015

On 08-Jan-15 16:09, Anant Patil wrote:
> On 16-Dec-14 09:41, Zane Bitter wrote:
>> On 15/12/14 09:32, Anant Patil wrote:
>>> On 12-Dec-14 06:29, Zane Bitter wrote:
>>>> On 11/12/14 01:14, Anant Patil wrote:
>>>>> On 04-Dec-14 10:49, Zane Bitter wrote:
>>>>>> On 01/12/14 02:02, Anant Patil wrote:
>>>>>>> On GitHub:https://github.com/anantpatil/heat-convergence-poc
>>>>>> I'm trying to review this code at the moment, and finding some stuff I
>>>>>> don't understand:
>>>>>> https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916
>>>>>> This appears to loop through all of the resources *prior* to kicking off
>>>>>> any actual updates to check if the resource will change. This is
>>>>>> impossible to do in general, since a resource may obtain a property
>>>>>> value from an attribute of another resource and there is no way to know
>>>>>> whether an update to said other resource would cause a change in the
>>>>>> attribute value.
>>>>>> In addition, no attempt to catch UpdateReplace is made. Although that
>>>>>> looks like a simple fix, I'm now worried about the level to which this
>>>>>> code has been tested.
>>>>> We were working on new branch and as we discussed on Skype, we have
>>>>> handled all these cases. Please have a look at our current branch:
>>>>> https://github.com/anantpatil/heat-convergence-poc/tree/graph-version
>>>>> When a new resource is taken for convergence, its children are loaded
>>>>> and the resource definition is re-parsed. The frozen resource definition
>>>>> will have all the "get_attr" resolved.
>>>>>> I'm also trying to wrap my head around how resources are cleaned up in
>>>>>> dependency order. If I understand correctly, you store in the
>>>>>> ResourceGraph table the dependencies between various resource names in
>>>>>> the current template (presumably there could also be some left around
>>>>>> from previous templates too?). For each resource name there may be a
>>>>>> number of rows in the Resource table, each with an incrementing version.
>>>>>> As far as I can tell though, there's nowhere that the dependency graph
>>>>>> for _previous_ templates is persisted? So if the dependency order
>>>>>> changes in the template we have no way of knowing the correct order to
>>>>>> clean up in any more? (There's not even a mechanism to associate a
>>>>>> resource version with a particular template, which might be one avenue
>>>>>> by which to recover the dependencies.)
>>>>>> I think this is an important case we need to be able to handle, so I
>>>>>> added a scenario to my test framework to exercise it and discovered that
>>>>>> my implementation was also buggy. Here's the fix:
>>>>>> https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40
>>>>> Thanks for pointing this out Zane. We too had a buggy implementation for
>>>>> handling inverted dependency. I had a hard look at our algorithm where
>>>>> we were continuously merging the edges from new template into the edges
>>>>> from previous updates. It was an optimized way of traversing the graph
>>>>> in both forward and reverse order with out missing any resources. But,
>>>>> when the dependencies are inverted,  this wouldn't work.
>>>>> We have changed our algorithm. The changes in edges are noted down in
>>>>> DB, only the delta of edges from previous template is calculated and
>>>>> kept. At any given point of time, the graph table has all the edges from
>>>>> current template and delta from previous templates. Each edge has
>>>>> template ID associated with it.
>>>> The thing is, the cleanup dependencies aren't really about the template.
>>>> The real resources really depend on other real resources. You can't
>>>> delete a Volume before its VolumeAttachment, not because it says so in
>>>> the template but because it will fail if you try. The template can give
>>>> us a rough guide in advance to what those dependencies will be, but if
>>>> that's all we keep then we are discarding information.
>>>> There may be multiple versions of a resource corresponding to one
>>>> template version. Even worse, the actual dependencies of a resource
>>>> change on a smaller time scale than an entire stack update (this is the
>>>> reason the current implementation updates the template one resource at a
>>>> time as we go).
>>> Absolutely! The edges from the template are kept only for the reference
>>> purposes. When we have a resource in new template, its template ID will
>>> also be marked to current template. At any point of time, realized
>>> resource will from current template, even if it were found in previous
>>> templates. The template ID "moves" for a resource if it is found.
>> In theory (disclaimer: I didn't implement this yet) it can change on an 
>> even smaller timescale than that. The existing plugins are something of 
>> a black box to us: if a failure occurs we don't necessarily know whether 
>> the real-world dependency is on the old or new version of another resource.
> Yes, and that's why we rollback the failed resource and its dependent
> resources to older versions, provided that, the older resources are not
> deleted unless update is done. It is easier with template Id as we know
> the previous complete template.
>>>> Given that our Resource entries in the DB are in 1:1 correspondence with
>>>> actual resources (we create a new one whenever we need to replace the
>>>> underlying resource), I found it makes the most conceptual and practical
>>>> sense to store the requirements in the resource itself, and update them
>>>> at the time they actually change in the real world (bonus: introduces no
>>>> new locking issues and no extra DB writes). I settled on this after a
>>>> legitimate attempt at trying other options, but they didn't work out:
>>>> https://github.com/zaneb/heat-convergence-prototype/commit/a62958342e8583f74e2aca90f6239ad457ba984d
>>> I am okay with the notion of graph stored in resource table.
>>>>> For resource clean up, we start from the
>>>>> first template (template which was completed and updates were made on
>>>>> top of it, empty template otherwise), and move towards the current
>>>>> template in the order in which the updates were issued, and for each
>>>>> template the graph (edges if found for the template) is traversed in
>>>>> reverse order and resources are cleaned-up.
>>>> I'm pretty sure this is backwards - you'll need to clean up newer
>>>> resources first because they may reference resources from older
>>>> templates. Also if you have a stubborn old resource that won't delete
>>>> you don't want that to block cleanups of anything newer.
>>>> You're also serialising some of the work unnecessarily because you've
>>>> discarded the information about dependencies that cross template
>>>> versions, forcing you to clean up only one template version at a time.
>>> Since only the changed set is stored, it is going to be smaller in most
>>> of the cases. Within each previous template, if there are any edges,
>>> they are tried for clean-up in reverse order. If a resource is not
>>> updated, but available in new template, the template ID will be current
>>> template ID as I mentioned above. With the template, the process is
>>> concurrent, but across the templates, it will be serial. I am okay with
>>> this since in real world, the change in dependencies may not be as much
>>> as we think.
>> I agree the serialisation probably won't make a big difference in most 
>> cases, although autoscaling is an obvious and important one where 
>> serialising a lot of things unnecessarily could have a big impact on the 
>> user.
> I was probably over-cautious when I tested it, but I can see that the
> resource clean-up can be triggered concurrently. This is possible since
> the update-replaced resources have different template IDs for replaced
> and replacement.
>>>>> The process ends up with
>>>>> current template being traversed in reverse order and resources being
>>>>> cleaned up. All the update-replaced resources from the older templates
>>>>> (older updates in concurrent updates) are cleaned up in the order in
>>>>> which they are suppose to be.
>>>>> Resources are now tied to template, they have a template_id instead of
>>>>> version. As we traverse the graph, we know which template we are working
>>>>> on, and can take the relevant action on resource.
>>>>> For rollback, another update is issued with the last completed template
>>>>> (it is designed to have an empty template as last completed template by
>>>>> default). The current template being worked on becomes predecessor for
>>>>> the new incoming template. In case of rollback, the last completed
>>>>> template becomes incoming new template, the current becomes the new
>>>>> template's predecessor and the successor of last completed template will
>>>>> have no predecessor. All these changes are available in the
>>>>> 'graph-version' branch. (The branch name is a misnomer though!)
>>>>> I think it is simpler to think about stack and concurrent updates when
>>>>> we associate resources and edges with template, and stack with current
>>>>> template and its predecessors (if any).
>>>> It doesn't seem simple to me because it's trying to reconstruct reality
>>>> from a lossy version of history. The simplest way to think about it, in
>>>> my opinion is this:
>>>> - When updating resources, respect their dependencies as given in the
>>>> template
>>>> - When checking resources to clean up, respect their actual, current
>>>> real-world dependencies, and check replacement resources before the
>>>> resources that they replaced.
>>>> - Don't check a resource for clean up until it has been updated to the
>>>> latest template.
>>>>> I also think that we should decouple Resource from Stack. This is really
>>>>> a hindrance when workers work on individual resources. The resource
>>>>> should be abstracted enough from stack for the worker to work on the
>>>>> resource alone. The worker should load the required resource plug-in and
>>>>> start converging.
>>>> I think that's a worthy goal, and it would be really nice if we could
>>>> load a Resource completely independently of its Stack, and I know this
>>>> has always been a design goal of yours (hence you're caching the
>>>> resource definition in the Resource rather than getting it from the
>>>> template).
>>>> That said, I am convinced it's an unachievable goal, and I believe we
>>>> should give up on it.
>>>> - We'll always need to load _some_ central thing (e.g. to find out if
>>>> the current traversal is still the valid one), so it might as well be
>>>> the Stack.
>>>> - Existing plugin abominations like HARestarter expect a working Stack
>>>> object to be provided so it can go hunting for other resources.
>>>> I think the best we can do is try to make heat.engine.stack.Stack as
>>>> lazy as possible so that it only does extra work when strictly required,
>>>> and just accept that the stack will always be loaded from the database.
>>>> I am also strongly in favour of treating the idea of caching the
>>>> unresolved resource definition in the Resource table as a straight
>>>> performance optimisation that is completely separate to the convergence
>>>> work. It's going to be inevitably ugly because there is no
>>>> template-format-independent way to serialise a resource definition
>>>> (while resource definition objects themselves are designed to be
>>>> inherently template-format-independent). Once phase 1 is complete we can
>>>> decide whether it's worth it based on measuring the actual performance
>>>> improvement.
>>> The idea of keeping the resource definition in DB was to decouple from
>>> the stack as much as possible. It is not for performance optimization.
>> OK, that's fair, but then you said...
>>> When a worker job is received, the worker won't have to load the entire
>>> template from the stack, get the realized definitions etc.
>> ...that's a performance optimisation.
> Ok... I was having decoupling of resource from stack in mind! The worker
> don't load the stack, but only the resource and work-on. I did mention
> scale somewhere but it was with SHA hash of resource definition for
> quick comparison.
>>> The worker
>>> simply loads the resource with definition with minimal stack (lazily
>>> loaded stack), and starts working on it. This work is towards making
>>> stack lazily loaded.
>>> Like you said, the stack has to be minimal when it is loaded. There will
>>> be few information needed by the worker (like resource definition and
>>> requirers/requirements) which can be obtained without loading the
>>> template and recalculating the graph again and again.
>> None of this changes the fundamental design, so let's get the important 
>> stuff done first and when that's finished we can see if this makes a 
>> meaningful (positive) difference.
>>>> (Note that we _already_ store the _resolved_ properties of the resource,
>>>> which is what the observer will be comparing against for phase 2, so
>>>> there should be no reason for the observer to need to load the stack.)
>>>>> The READEME.rst is really helpful for bringing up the minimal devstack
>>>>> and test the PoC. I also has some notes on design.
>>>> [snip]
>>>>> Zane, I have few questions:
>>>>> 1. Our current implementation is based on notifications from worker so
>>>>> that the engine can take up next set of tasks. I don't see this in your
>>>>> case. I think we should be doing this. It gels well with observer
>>>>> notification mechanism. When the observer comes, it would send a
>>>>> converge notification. Both, the provisioning of stack and the
>>>>> continuous observation, happens with notifications (async message
>>>>> passing). I see that the workers in your case pick up the parent when/if
>>>>> it is done and schedules it or updates the sync point.
>>>> I'm not quite sure what you're asking here, so forgive me if I'm
>>>> misunderstanding. What I think you're saying is that where my prototype
>>>> propagates notifications thus:
>>>>     worker -> worker
>>>>            -> worker
>>>> (where -> is an async message)
>>>> you would prefer it to do:
>>>>     worker -> engine -> worker
>>>>                      -> worker
>>>> Is that right?
>>>> To me the distinction seems somewhat academic, given that we've decided
>>>> that the engine and the worker will be the same process. I don't see a
>>>> disadvantage to doing right away stuff that we know needs to be done
>>>> right away. Obviously we should factor the code out tidily into a
>>>> separate method where we can _also_ expose it as a notification that can
>>>> be triggered by the continuous observer.
>>>> You mentioned above that you thought the workers should not ever load
>>>> the Stack, and I think that's probably the reason you favour this
>>>> approach: the 'worker' would always load just the Resource and the
>>>> 'engine' (even though they're really the same) would always load just
>>>> the Stack, right?
>>>> However, as I mentioned above, I think we'll want/have to load the Stack
>>>> in the worker anyway, so eliminating the extra asynchronous call
>>>> eliminates the performance penalty for having to do so.
>>> When we started, we had the following idea:
>>>              (provisions with
>>>               converge API)             (Observe)
>>>      Engine ------------------> Worker --------------> Observer
>>>        |                         | |                        |
>>>        |                         | |                        |
>>>        ^     (done)              v ^         (converge)     v
>>>         ------------------------    ------------------------
>>> The boundary has to be clearly demarcated, as done above. While
>>> provisioning a stack, the Engine uses the converge facility, as does the
>>> observer when it sees a need to converge a resource.
>>> Even though we have these logical things in same process, by not clearly
>>> demarcating the responsibilities of each logical entity, we will end-up
>>> with issues which we face when we mix the responsibilities.
>> Sure, and that's why we don't write all the code in one big method, but 
>> it doesn't follow that it has to be in a different process. Or in this 
>> case, since they're actually the same process anyway, it doesn't imply 
>> that they have to be called over RPC.
>>> As you mentioned, the worker should have a notification API exposed
>>> which can be used not only by the Observer, but also by the engine to
>>> provision (CRUD) a stack resource.
>> yep
>>> The engine decides on which resource
>>> to be converged next when a resource is done.
>> Who says it has to? Each resource knows what depends on it... let it 
>> send the notifications directly. The engine is just there to do whatever 
>> setup requires everything to be loaded at once and then kick things off 
>> by starting the worker tasks for resources that don't depend on anything.
> I am fine with this.
>>> The observer is only
>>> responsible for once resource and issues a converge request if it has
>>> to.
>>>>> 2. The dependency graph travels everywhere. IMHO, we can keep the graph
>>>>> in DB and let the workers work on a resource, and engine decide which
>>>>> one to be scheduled next by looking at the graph. There wouldn't be a
>>>>> need for a lock here, in the engine, the DB transactions should take
>>>>> care of concurrent DB updates. Our new PoC follows this model.
>>>> I'm fine with keeping the graph in the DB instead of having it flow with
>>>> the notifications.
>>>>> 3. The request ID is passed down to check_*_complete. Would the check
>>>>> method be interrupted if new request arrives? IMHO, the check method
>>>>> should not get interrupted. It should return back when the resource has
>>>>> reached a concrete state, either failed or completed.
>>>> I agree, it should not be interrupted.
>>>> I've started to think of phase 1 and phase 2 like this:
>>>> 1) Make locks more granular: stack-level lock becomes resource-level
>>>> 2) Get rid of locks altogether
>>> I thought that the DB transactions were enough for concurrency issues.
>>> The resources status is enough to know whether to trigger the update or
>>> not. For resources currently in progress, the engine can wait for
>>> notifications from then and trigger the update. With completely
>>> distributed graph and decision making it is hard to achieve. If the
>>> graph is stored in DB, the decision to trigger update is at one place
>>> (by looking at the graph only) and hence locks are not required.
>> Right, right, but it's still a lock (maybe I should have said mutex) 
>> because only one traversal can be operating on each resource at a time. 
>> We're just arguing semantics now. It doesn't matter whether you do 
>> select-for-update on the Resource table to mark it as in progress or if 
>> you do select-for-update on a separate Lock table to mark it as in 
>> progress, only one update can be working on a given resource at a time, 
>> and this is a big improvement in granularity because before only one 
>> update could be working on the whole stack at a time, but not as big an 
>> improvement as phase 2 will be.
> The select-for-update is good only for updates on one table. What if the
> two tables are to be updated? I have stored the graph in a table and I
> update both the graph and some information in resource in a transaction.
> I don't see a value in keeping only one table with all the information
> stuffed in it. Ideally, DB tables have to normalized to avoid data
> duplication. The graph table is normalized for graph information and
> resource table for resource.
>>>> So in phase 1 we'll lock the resources and like you said, it will return
>>>> back when it has reached a concrete state. In phase 2 we'll be able to
>>>> just update the goal state for the resource and the observe/converge
>>>> process will be able to automagically find the best way to that state
>>>> regardless of whether it was in the middle of transitioning to another
>>>> state or not. Or something. But that's for the future :)
>>> I am thinking about not having any locks at all. The concurrency is
>>> handled by DB transaction and notifications gives us hint on deciding
>>> when and whether to update the next set of resources. If a resource is
>>> in progress currently, and an update is issued on that resource, the
>>> graph API doesn't return it as a ready-to-be-converged resource as a
>>> previous version is in progress. When it is done, and notification is
>>> received, the graph DB API returns it as next to converge.
>> So sort of a hybrid with Michal's approach.
> Not sure what is Michal's approach.
>> I think I get what you're saying here. If we reach a particular resource 
>> in the traverse and it is still in-progress from a previous update we 
>> can ignore it and use the notification from the previous update 
>> finishing to trigger it later, provided that the notifications go 
>> through the engine because the in-progress resource can't possibly know 
>> anything about traversals that started later.
>> Or on the other hand we could just sleep for a while until it's done and 
>> then carry on.
> I would prefer notifications triggering next set of actions rather than
> keeping many eventlets sleeping.
>>>>> 4. Lot of synchronization issues which we faced in our PoC cannot be
>>>>> encountered with the framework. How do we evaluate what happens when
>>>>> synchronization issues are encountered (like stack lock kind of issues
>>>>> which we are replacing with DB transaction).
>>>> Right, yeah, this is obviously the big known limitation of the
>>>> simulator. I don't have a better answer other than to Think Very Hard
>>>> about it.
>>>> Designing software means solving for hundreds of constraints - too many
>>>> for a human to hold in their brain at the same time. The purpose of
>>>> prototyping is to fix enough of the responses to those constraints in a
>>>> concrete form to allow reasoning about the remaining ones to become
>>>> tractable. If you fix solutions for *all* of the constraints, then what
>>>> you've built is by definition not a prototype but the final product.
>>>> One technique available to us is to encapsulate the parts of the
>>>> algorithm that are subject to synchronisation issues behind abstractions
>>>> that offer stronger guarantees. Then in order to have confidence in the
>>>> design we need only satisfy ourselves that we have analysed the
>>>> guarantees correctly and that a concrete implementation offering those
>>>> same guarantees is possible. For example, the SyncPoints are shown to
>>>> work under the assumption that they are not subject to race conditions,
>>>> and the SyncPoint code is small enough that we can easily see that it
>>>> can be implemented in an atomic fashion using the same DB primitives
>>>> already proven to work by StackLock. Therefore we can have a very high
>>>> confidence (but not proof) that the overall algorithm will work when
>>>> implemented in the final product.
>>>> Having Thought Very Hard about it, I'm as confident as I can be that I'm
>>>> not relying on any synchronisation properties that can't be implemented
>>>> using select-for-update on a single database row. There will of course
>>>> be surprises at implementation time, but I hope that won't be one of
>>>> them and anticipate that any changes required to the plan will be
>>>> localised and not wide-ranging.
>>>> (This is in contrast BTW to my centralised-graph branch, linked above,
>>>> where it became very obvious that it would require some sort of external
>>>> locking - so there is reason to think that this process can reveal
>>>> architectural problems related to synchronisation where they are present.)
>>>> cheers,
>>>> Zane.
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> Zane,
> I was playing around with the PoC and found couple of issues:
> 1. The stack was failing when there were single disjoint resources or
> just one resource in template. The graph did not include this resource
> due to a minor bug in dependency_names(). I have added a test case and
> fix here:
> https://github.com/anantpatil/heat-convergence-prototype/commit/b58abd77cf596475ecf3f19ed38adf8ad3bb6b3b
> 2. The resource graph is created with keys in both forward order
> traversal and reverse order traversal and the update will finish the
> forward order and attempt the reverse order. If this is the case, then
> the update-replaced resources will be deleted before the update is
> complete and if the update fails, the old resource is not available for
> roll-back; a new resource has to be created then. I have added a test
> case at the above mentioned location.
> In our PoC, the updates (concurrent updates) won't remove a
> update-replaced resource until all the resources are updated, and
> resource clean-up phase is started. It is unacceptable to remove the old
> resource to be rolled-back to since it may have changes which the user
> doesn't want to loose; and that's why probably they use the roll-back
> flag.

Here is the link to scenario that I mentioned in (2) above. Please
ignore the one specified in above link in (1).
This scenario simulates an update replace of a resource and then
rolls-back. The test case is to test that the replaced resource is not
re-created for rollback. The old resource should be retained for
roll-back cases.

> - Anant
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

More information about the OpenStack-dev mailing list