[openstack-dev] [Heat] Convergence proof-of-concept showdown

Zane Bitter zbitter at redhat.com
Fri Dec 12 00:59:53 UTC 2014

On 11/12/14 01:14, Anant Patil wrote:
> On 04-Dec-14 10:49, Zane Bitter wrote:
>> On 01/12/14 02:02, Anant Patil wrote:
>>> On GitHub:https://github.com/anantpatil/heat-convergence-poc
>> I'm trying to review this code at the moment, and finding some stuff I
>> don't understand:
>> https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916
>> This appears to loop through all of the resources *prior* to kicking off
>> any actual updates to check if the resource will change. This is
>> impossible to do in general, since a resource may obtain a property
>> value from an attribute of another resource and there is no way to know
>> whether an update to said other resource would cause a change in the
>> attribute value.
>> In addition, no attempt to catch UpdateReplace is made. Although that
>> looks like a simple fix, I'm now worried about the level to which this
>> code has been tested.
> We were working on new branch and as we discussed on Skype, we have
> handled all these cases. Please have a look at our current branch:
> https://github.com/anantpatil/heat-convergence-poc/tree/graph-version
> When a new resource is taken for convergence, its children are loaded
> and the resource definition is re-parsed. The frozen resource definition
> will have all the "get_attr" resolved.
>> I'm also trying to wrap my head around how resources are cleaned up in
>> dependency order. If I understand correctly, you store in the
>> ResourceGraph table the dependencies between various resource names in
>> the current template (presumably there could also be some left around
>> from previous templates too?). For each resource name there may be a
>> number of rows in the Resource table, each with an incrementing version.
>> As far as I can tell though, there's nowhere that the dependency graph
>> for _previous_ templates is persisted? So if the dependency order
>> changes in the template we have no way of knowing the correct order to
>> clean up in any more? (There's not even a mechanism to associate a
>> resource version with a particular template, which might be one avenue
>> by which to recover the dependencies.)
>> I think this is an important case we need to be able to handle, so I
>> added a scenario to my test framework to exercise it and discovered that
>> my implementation was also buggy. Here's the fix:
>> https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40
> Thanks for pointing this out Zane. We too had a buggy implementation for
> handling inverted dependency. I had a hard look at our algorithm where
> we were continuously merging the edges from new template into the edges
> from previous updates. It was an optimized way of traversing the graph
> in both forward and reverse order with out missing any resources. But,
> when the dependencies are inverted,  this wouldn't work.
> We have changed our algorithm. The changes in edges are noted down in
> DB, only the delta of edges from previous template is calculated and
> kept. At any given point of time, the graph table has all the edges from
> current template and delta from previous templates. Each edge has
> template ID associated with it.

The thing is, the cleanup dependencies aren't really about the template. 
The real resources really depend on other real resources. You can't 
delete a Volume before its VolumeAttachment, not because it says so in 
the template but because it will fail if you try. The template can give 
us a rough guide in advance to what those dependencies will be, but if 
that's all we keep then we are discarding information.

There may be multiple versions of a resource corresponding to one 
template version. Even worse, the actual dependencies of a resource 
change on a smaller time scale than an entire stack update (this is the 
reason the current implementation updates the template one resource at a 
time as we go).

Given that our Resource entries in the DB are in 1:1 correspondence with 
actual resources (we create a new one whenever we need to replace the 
underlying resource), I found it makes the most conceptual and practical 
sense to store the requirements in the resource itself, and update them 
at the time they actually change in the real world (bonus: introduces no 
new locking issues and no extra DB writes). I settled on this after a 
legitimate attempt at trying other options, but they didn't work out: 

> For resource clean up, we start from the
> first template (template which was completed and updates were made on
> top of it, empty template otherwise), and move towards the current
> template in the order in which the updates were issued, and for each
> template the graph (edges if found for the template) is traversed in
> reverse order and resources are cleaned-up.

I'm pretty sure this is backwards - you'll need to clean up newer 
resources first because they may reference resources from older 
templates. Also if you have a stubborn old resource that won't delete 
you don't want that to block cleanups of anything newer.

You're also serialising some of the work unnecessarily because you've 
discarded the information about dependencies that cross template 
versions, forcing you to clean up only one template version at a time.

> The process ends up with
> current template being traversed in reverse order and resources being
> cleaned up. All the update-replaced resources from the older templates
> (older updates in concurrent updates) are cleaned up in the order in
> which they are suppose to be.
> Resources are now tied to template, they have a template_id instead of
> version. As we traverse the graph, we know which template we are working
> on, and can take the relevant action on resource.
> For rollback, another update is issued with the last completed template
> (it is designed to have an empty template as last completed template by
> default). The current template being worked on becomes predecessor for
> the new incoming template. In case of rollback, the last completed
> template becomes incoming new template, the current becomes the new
> template's predecessor and the successor of last completed template will
> have no predecessor. All these changes are available in the
> 'graph-version' branch. (The branch name is a misnomer though!)
> I think it is simpler to think about stack and concurrent updates when
> we associate resources and edges with template, and stack with current
> template and its predecessors (if any).

It doesn't seem simple to me because it's trying to reconstruct reality 
from a lossy version of history. The simplest way to think about it, in 
my opinion is this:
- When updating resources, respect their dependencies as given in the 
- When checking resources to clean up, respect their actual, current 
real-world dependencies, and check replacement resources before the 
resources that they replaced.
- Don't check a resource for clean up until it has been updated to the 
latest template.

> I also think that we should decouple Resource from Stack. This is really
> a hindrance when workers work on individual resources. The resource
> should be abstracted enough from stack for the worker to work on the
> resource alone. The worker should load the required resource plug-in and
> start converging.

I think that's a worthy goal, and it would be really nice if we could 
load a Resource completely independently of its Stack, and I know this 
has always been a design goal of yours (hence you're caching the 
resource definition in the Resource rather than getting it from the 

That said, I am convinced it's an unachievable goal, and I believe we 
should give up on it.

- We'll always need to load _some_ central thing (e.g. to find out if 
the current traversal is still the valid one), so it might as well be 
the Stack.
- Existing plugin abominations like HARestarter expect a working Stack 
object to be provided so it can go hunting for other resources.

I think the best we can do is try to make heat.engine.stack.Stack as 
lazy as possible so that it only does extra work when strictly required, 
and just accept that the stack will always be loaded from the database.

I am also strongly in favour of treating the idea of caching the 
unresolved resource definition in the Resource table as a straight 
performance optimisation that is completely separate to the convergence 
work. It's going to be inevitably ugly because there is no 
template-format-independent way to serialise a resource definition 
(while resource definition objects themselves are designed to be 
inherently template-format-independent). Once phase 1 is complete we can 
decide whether it's worth it based on measuring the actual performance 

(Note that we _already_ store the _resolved_ properties of the resource, 
which is what the observer will be comparing against for phase 2, so 
there should be no reason for the observer to need to load the stack.)

> The READEME.rst is really helpful for bringing up the minimal devstack
> and test the PoC. I also has some notes on design.
> Zane, I have few questions:
> 1. Our current implementation is based on notifications from worker so
> that the engine can take up next set of tasks. I don't see this in your
> case. I think we should be doing this. It gels well with observer
> notification mechanism. When the observer comes, it would send a
> converge notification. Both, the provisioning of stack and the
> continuous observation, happens with notifications (async message
> passing). I see that the workers in your case pick up the parent when/if
> it is done and schedules it or updates the sync point.

I'm not quite sure what you're asking here, so forgive me if I'm 
misunderstanding. What I think you're saying is that where my prototype 
propagates notifications thus:

   worker -> worker
          -> worker

(where -> is an async message)
you would prefer it to do:

   worker -> engine -> worker
                    -> worker

Is that right?

To me the distinction seems somewhat academic, given that we've decided 
that the engine and the worker will be the same process. I don't see a 
disadvantage to doing right away stuff that we know needs to be done 
right away. Obviously we should factor the code out tidily into a 
separate method where we can _also_ expose it as a notification that can 
be triggered by the continuous observer.

You mentioned above that you thought the workers should not ever load 
the Stack, and I think that's probably the reason you favour this 
approach: the 'worker' would always load just the Resource and the 
'engine' (even though they're really the same) would always load just 
the Stack, right?

However, as I mentioned above, I think we'll want/have to load the Stack 
in the worker anyway, so eliminating the extra asynchronous call 
eliminates the performance penalty for having to do so.

> 2. The dependency graph travels everywhere. IMHO, we can keep the graph
> in DB and let the workers work on a resource, and engine decide which
> one to be scheduled next by looking at the graph. There wouldn't be a
> need for a lock here, in the engine, the DB transactions should take
> care of concurrent DB updates. Our new PoC follows this model.

I'm fine with keeping the graph in the DB instead of having it flow with 
the notifications.

> 3. The request ID is passed down to check_*_complete. Would the check
> method be interrupted if new request arrives? IMHO, the check method
> should not get interrupted. It should return back when the resource has
> reached a concrete state, either failed or completed.

I agree, it should not be interrupted.

I've started to think of phase 1 and phase 2 like this:

1) Make locks more granular: stack-level lock becomes resource-level
2) Get rid of locks altogether

So in phase 1 we'll lock the resources and like you said, it will return 
back when it has reached a concrete state. In phase 2 we'll be able to 
just update the goal state for the resource and the observe/converge 
process will be able to automagically find the best way to that state 
regardless of whether it was in the middle of transitioning to another 
state or not. Or something. But that's for the future :)

> 4. Lot of synchronization issues which we faced in our PoC cannot be
> encountered with the framework. How do we evaluate what happens when
> synchronization issues are encountered (like stack lock kind of issues
> which we are replacing with DB transaction).

Right, yeah, this is obviously the big known limitation of the 
simulator. I don't have a better answer other than to Think Very Hard 
about it.

Designing software means solving for hundreds of constraints - too many 
for a human to hold in their brain at the same time. The purpose of 
prototyping is to fix enough of the responses to those constraints in a 
concrete form to allow reasoning about the remaining ones to become 
tractable. If you fix solutions for *all* of the constraints, then what 
you've built is by definition not a prototype but the final product.

One technique available to us is to encapsulate the parts of the 
algorithm that are subject to synchronisation issues behind abstractions 
that offer stronger guarantees. Then in order to have confidence in the 
design we need only satisfy ourselves that we have analysed the 
guarantees correctly and that a concrete implementation offering those 
same guarantees is possible. For example, the SyncPoints are shown to 
work under the assumption that they are not subject to race conditions, 
and the SyncPoint code is small enough that we can easily see that it 
can be implemented in an atomic fashion using the same DB primitives 
already proven to work by StackLock. Therefore we can have a very high 
confidence (but not proof) that the overall algorithm will work when 
implemented in the final product.

Having Thought Very Hard about it, I'm as confident as I can be that I'm 
not relying on any synchronisation properties that can't be implemented 
using select-for-update on a single database row. There will of course 
be surprises at implementation time, but I hope that won't be one of 
them and anticipate that any changes required to the plan will be 
localised and not wide-ranging.

(This is in contrast BTW to my centralised-graph branch, linked above, 
where it became very obvious that it would require some sort of external 
locking - so there is reason to think that this process can reveal 
architectural problems related to synchronisation where they are present.)


More information about the OpenStack-dev mailing list