[openstack-dev] [Heat] Final steps toward a Convergence design

Zane Bitter zbitter at redhat.com
Tue Jan 20 01:36:25 UTC 2015


Hi folks,
I'd like to come to agreement on the last major questions of the 
convergence design. I well aware that I am the current bottleneck as I 
have been struggling to find enough time to make progress on it, but I 
think we are now actually very close.

I believe the last remaining issue to be addressed is the question of 
what to do when we want to update a resource that is still IN_PROGRESS 
as the result of a previous (now cancelled, obviously) update.

There are, of course, a couple of trivial and wrong ways to handle it:

1) Throw UpdateReplace and make a new one
  - This is obviously a terrible solution for the user

2) Poll the DB in a loop until the previous update finishes
  - This is obviously horribly inefficient

So the preferred solution here needs to involve retriggering the 
resource's task in the current update once the one from the previous 
update is complete.


I've implemented some changes in the simulator - although note that 
unlike stuff I implemented previously, this is extremely poorly tested 
(if at all) since the simulator runs the tasks serially and therefore 
never hits this case. So code review would be appreciated. I committed 
the changes on a new branch, "resumable":

https://github.com/zaneb/heat-convergence-prototype/commits/resumable

Here is a brief summary:
- The SyncPoints are now:
   * created for every resource, regardless of how many dependencies it has.
   * created at the beginning of an update and deleted before beginning 
another update.
   * contain only the list of satisfied dependencies (and their RefId 
and attribute values).
- The graph is now stored in the Stack table, rather than passed through 
the chain of trigger notifications.
- We'll use locks in the Resource table to ensure that only one action 
at a time can happen on a Resource.
- When a trigger is received for a resource that is locked (i.e. status 
is IN_PROGRESS and the engine owning it is still alive), the trigger is 
ignored.
- When processing of a resource completes, a failure to find any of the 
sync points that are to be notified (every resource has at least one, 
since the last resource in each chain must notify the stack that it is 
complete) indicates that the current update has been cancelled and 
triggers a new check on the resource with the data for the current 
update (retrieved from the Stack table) if it is ready (as indicated by 
its SyncPoint entry).

I'm not 100% happy with the amount of extra load this puts on the 
database, but I can't see a way to do significantly better and still 
solve this locking issue. Suggestions are welcome. At least the common 
case is considerably better than the worst case.

There are two races here that we need to satisfy ourselves we have 
answers for (I think we do):
1) Since old SyncPoints are deleted before a new transition begins and 
we only look for them after unlocking the resource being processed, I 
don't believe that both the previous and the new update can fail to 
trigger the check on the resource in the new update's traversal. (If 
there are any DB experts out there, I'd be interested in their input on 
this one.)
2) When both the previous and the new update end up triggering a check 
on the resource in the new update's traversal, we'll only perform one 
because one will succeed in locking the resource and the other will just 
be ignored after it fails to acquire the lock. (This one is watertight, 
since both processes are acting on the same lock.)


I believe that this model is very close to what Anant and his team are 
proposing. Arguably this means I've been wasting everyone's time, but a 
happier way to look at it is that two mostly independent design efforts 
converging on a similar solution is something we can take a lot of 
confidence from ;)

My next task is to start breaking this down into blueprints that folks 
can start implementing. In the meantime, it would be great if we could 
identify any remaining discrepancies between the two designs and 
completely close those last gaps.

cheers,
Zane.



More information about the OpenStack-dev mailing list