[openstack-dev] [Heat] Convergence proof-of-concept showdown

Clint Byrum clint at fewbar.com
Mon Dec 15 19:29:03 UTC 2014

Excerpts from Anant Patil's message of 2014-12-15 07:15:30 -0800:
> On 13-Dec-14 05:42, Zane Bitter wrote:
> > On 12/12/14 05:29, Murugan, Visnusaran wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: Zane Bitter [mailto:zbitter at redhat.com]
> >>> Sent: Friday, December 12, 2014 6:37 AM
> >>> To: openstack-dev at lists.openstack.org
> >>> Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
> >>> showdown
> >>>
> >>> On 11/12/14 08:26, Murugan, Visnusaran wrote:
> >>>>>> [Murugan, Visnusaran]
> >>>>>> In case of rollback where we have to cleanup earlier version of
> >>>>>> resources,
> >>>>> we could get the order from old template. We'd prefer not to have a
> >>>>> graph table.
> >>>>>
> >>>>> In theory you could get it by keeping old templates around. But that
> >>>>> means keeping a lot of templates, and it will be hard to keep track
> >>>>> of when you want to delete them. It also means that when starting an
> >>>>> update you'll need to load every existing previous version of the
> >>>>> template in order to calculate the dependencies. It also leaves the
> >>>>> dependencies in an ambiguous state when a resource fails, and
> >>>>> although that can be worked around it will be a giant pain to implement.
> >>>>>
> >>>>
> >>>> Agree that looking to all templates for a delete is not good. But
> >>>> baring Complexity, we feel we could achieve it by way of having an
> >>>> update and a delete stream for a stack update operation. I will
> >>>> elaborate in detail in the etherpad sometime tomorrow :)
> >>>>
> >>>>> I agree that I'd prefer not to have a graph table. After trying a
> >>>>> couple of different things I decided to store the dependencies in the
> >>>>> Resource table, where we can read or write them virtually for free
> >>>>> because it turns out that we are always reading or updating the
> >>>>> Resource itself at exactly the same time anyway.
> >>>>>
> >>>>
> >>>> Not sure how this will work in an update scenario when a resource does
> >>>> not change and its dependencies do.
> >>>
> >>> We'll always update the requirements, even when the properties don't
> >>> change.
> >>>
> >>
> >> Can you elaborate a bit on rollback.
> > 
> > I didn't do anything special to handle rollback. It's possible that we 
> > need to - obviously the difference in the UpdateReplace + rollback case 
> > is that the replaced resource is now the one we want to keep, and yet 
> > the replaced_by/replaces dependency will force the newer (replacement) 
> > resource to be checked for deletion first, which is an inversion of the 
> > usual order.
> > 
> This is where the version is so handy! For UpdateReplaced ones, there is
> an older version to go back to. This version could just be template ID,
> as I mentioned in another e-mail. All resources are at the current
> template ID if they are found in the current template, even if they is
> no need to update them. Otherwise, they need to be cleaned-up in the
> order given in the previous templates.
> I think the template ID is used as version as far as I can see in Zane's
> PoC. If the resource template key doesn't match the current template
> key, the resource is deleted. The version is misnomer here, but that
> field (template id) is used as though we had versions of resources.
> > However, I tried to think of a scenario where that would cause problems 
> > and I couldn't come up with one. Provided we know the actual, real-world 
> > dependencies of each resource I don't think the ordering of those two 
> > checks matters.
> > 
> > In fact, I currently can't think of a case where the dependency order 
> > between replacement and replaced resources matters at all. It matters in 
> > the current Heat implementation because resources are artificially 
> > segmented into the current and backup stacks, but with a holistic view 
> > of dependencies that may well not be required. I tried taking that line 
> > out of the simulator code and all the tests still passed. If anybody can 
> > think of a scenario in which it would make a difference, I would be very 
> > interested to hear it.
> > 
> > In any event though, it should be no problem to reverse the direction of 
> > that one edge in these particular circumstances if it does turn out to 
> > be a problem.
> > 
> >> We had an approach with depends_on
> >> and needed_by columns in ResourceTable. But dropped it when we figured out
> >> we had too many DB operations for Update.
> > 
> > Yeah, I initially ran into this problem too - you have a bunch of nodes 
> > that are waiting on the current node, and now you have to go look them 
> > all up in the database to see what else they're waiting on in order to 
> > tell if they're ready to be triggered.
> > 
> > It turns out the answer is to distribute the writes but centralise the 
> > reads. So at the start of the update, we read all of the Resources, 
> > obtain their dependencies and build one central graph[1]. We than make 
> > that graph available to each resource (either by passing it as a 
> > notification parameter, or storing it somewhere central in the DB that 
> > they will all have to read anyway, i.e. the Stack). But when we update a 
> > dependency we don't update the central graph, we update the individual 
> > Resource so there's no global lock required.
> > 
> > [1] 
> > https://github.com/zaneb/heat-convergence-prototype/blob/distributed-graph/converge/stack.py#L166-L168
> > 
> A centralized graph and decision making will make the implementation far
> more simpler than distributed. This looks academic, but the simplicity
> beats everything! When each worker has to decide, there needs to be
> lock, only DB transactions are not enough. In contrast, when the
> decision making is centralized, that particular critical section can be
> attempted with transaction and re-attempted if needed.

I'm concerned that we're losing sight of the whole point of convergence.

Yes, concurrency is hard, and state management is really the only thing
hard about concurrency.

What Zane is suggesting is a lock-free approach commonly called 'Read
Copy Update' or "RCU". It has high reader concurrency, but relies on
there only being one writer. It is quite simple, and has proven itself
enough to even be included in the Linux kernel:


> With the distributed approach, I see following drawbacks:
> 1. Every time a resource is done, the peer resources (siblings) are
> checked to see if they are done and the parent is propagated. This
> happens for each resource.

This is slightly inaccurate.

Every time a resource is done, resources that depend on that resource
are checked to see if they still have any unresolved dependencies.

> 2. The worker has to run through all the resources to see if the stack
> is done, to mark it as completed.

If a worker completes a resource which has no dependent resources, it
only needs to check to see if all of the other edges of the graph are
complete to mark the state as complete. There is no full traversal
unless you want to make sure nothing regressed, which is not the way any
of the specs read.

> 3. The decision to converge is made by each worker resulting in lot of
> contention. The centralized graph restricts the contention point to one
> place where we can use DB transactions. It is easier to maintain code
> where particular decisions are made at a place rather than at many
> places.

Why not let multiple workers use DB transactions? The contention happens
_only if it needs to happen to preserve transaction consistency_ instead
of _always_.

> 4. The complex part we are trying to solve is to decide on what to do
> next when a resource is done. With centralized graph, this is abstracted
> out to the DB API. The API will return the next set of nodes. A smart
> SQL query can reduce a lot of logic currently being coded in
> worker/engine.

Having seen many such "smart" SQL queries, I have to say, this is
terrifying. Complexity in database access is by far the single biggest
obstacle to scaling out.

I don't really know why you'd want logic to move into the database. It
is the one place that you must keep simple in order to scale. We can
scale out python like crazy.. but SQL is generally a single threaded,
impossible to debug component. So make the schema obvious and accesses
to it straight forward.

I think we need to land somewhere between the two approaches though.
Here is my idea for DB interaction, I realize now it's been in my head
for a while but I never shared:

CREATE TABLE resource (
  id ...,
  ! all the stuff now
  version int,
  replaced_by int,
  complete_order int,
  primary key (id, version),
  key idx_replaced_by (replaced_by));

CREATE TABLE resource_dependencies (
  id ....,
  version int,
  needed_by ...
  primary key (id, version, needed_by));

Then completion happens something like this:

SELECT @complete_order := MAX(complete_order) FROM resource WHERE stack_id = :stack_id:
SET @complete_order := @complete_order + 1
UPDATE resource SET complete_order = @complete_order, state='COMPLETE' WHERE id=:id: AND version=:version:;
! if there is a replaced_version
UPDATE resource SET replaced_by=:version: WHERE id=:id: AND version=:replaced_version:;
SELECT DISTINCT r.id FROM resource r INNER JOIN resource_dependencies rd
       ON r.id = rd.resource_id AND r.version = rd.version
WHERE r.version=:version: AND rd.needed_by=:id: AND r.state != 'COMPLETE'
for id in results:

Perhaps I've missed some revelation that makes this hard or impossible.
But I don't see a ton of database churn (one update per completion is
meh). I also don't see a lot of complexity in the query. The
complete_order can be used to handle deletes in the right order later
(note that I know that is probably the wrong way to do it and sequences
are a thing that can be used for this).

> 5. What would be the starting point for resource clean-up? The clean-up
> has to start when all the resources are updated. With no centralized
> graph, the DB has to be searched for all the resources with no
> dependencies and with older versions (or having older template keys) and
> start removing them. With centralized graph, this would be a simpler
> with a SQL queries returning what needs to be done. The search space for
> where to start with clean-up will be huge.

"Searched" may be the wrong way. With the table structure above, you can
find everything to delete with this query:

! Outright deletes
SELECT r_old.id
FROM resource r_old LEFT OUTER JOIN resource r_new
  ON r_old.id = r_new.id AND r_old.version = :cleanup_version:
WHERE r_new.id IS NULL OR r_old.replaced_by IS NOT NULL
ORDER BY DESC r.complete_order;

That should delete everything in more or less the right order. I think
for that one you can just delete the rows as they're confirmed deleted
from the plugins, no large transaction needed since we'd not expect
these rows to be updated anymore.

> 6. When engine restarts, the search space on where to start will be
> huge. With a centralized graph, the abstracted API to get next set of
> nodes makes the implementation of decision simpler.
> I am convinced enough that it is simpler to assign the responsibility to
> engine on what needs to be done next. No locks will be required, not
> even resource locks! It is simpler from implementation, understanding
> and maintenance perspective.

I thought you started saying you would need locks, but now saying you
won't. I agree no abstract locking is needed, just a consistent view of
the graph in the DB.

More information about the OpenStack-dev mailing list