[openstack-dev] [Heat] Convergence proof-of-concept showdown

Clint Byrum clint at fewbar.com
Thu Dec 18 17:35:27 UTC 2014

Excerpts from Anant Patil's message of 2014-12-16 07:36:58 -0800:
> On 16-Dec-14 00:59, Clint Byrum wrote:
> > Excerpts from Anant Patil's message of 2014-12-15 07:15:30 -0800:
> >> On 13-Dec-14 05:42, Zane Bitter wrote:
> >>> On 12/12/14 05:29, Murugan, Visnusaran wrote:
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Zane Bitter [mailto:zbitter at redhat.com]
> >>>>> Sent: Friday, December 12, 2014 6:37 AM
> >>>>> To: openstack-dev at lists.openstack.org
> >>>>> Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
> >>>>> showdown
> >>>>>
> >>>>> On 11/12/14 08:26, Murugan, Visnusaran wrote:
> >>>>>>>> [Murugan, Visnusaran]
> >>>>>>>> In case of rollback where we have to cleanup earlier version of
> >>>>>>>> resources,
> >>>>>>> we could get the order from old template. We'd prefer not to have a
> >>>>>>> graph table.
> >>>>>>>
> >>>>>>> In theory you could get it by keeping old templates around. But that
> >>>>>>> means keeping a lot of templates, and it will be hard to keep track
> >>>>>>> of when you want to delete them. It also means that when starting an
> >>>>>>> update you'll need to load every existing previous version of the
> >>>>>>> template in order to calculate the dependencies. It also leaves the
> >>>>>>> dependencies in an ambiguous state when a resource fails, and
> >>>>>>> although that can be worked around it will be a giant pain to implement.
> >>>>>>>
> >>>>>>
> >>>>>> Agree that looking to all templates for a delete is not good. But
> >>>>>> baring Complexity, we feel we could achieve it by way of having an
> >>>>>> update and a delete stream for a stack update operation. I will
> >>>>>> elaborate in detail in the etherpad sometime tomorrow :)
> >>>>>>
> >>>>>>> I agree that I'd prefer not to have a graph table. After trying a
> >>>>>>> couple of different things I decided to store the dependencies in the
> >>>>>>> Resource table, where we can read or write them virtually for free
> >>>>>>> because it turns out that we are always reading or updating the
> >>>>>>> Resource itself at exactly the same time anyway.
> >>>>>>>
> >>>>>>
> >>>>>> Not sure how this will work in an update scenario when a resource does
> >>>>>> not change and its dependencies do.
> >>>>>
> >>>>> We'll always update the requirements, even when the properties don't
> >>>>> change.
> >>>>>
> >>>>
> >>>> Can you elaborate a bit on rollback.
> >>>
> >>> I didn't do anything special to handle rollback. It's possible that we 
> >>> need to - obviously the difference in the UpdateReplace + rollback case 
> >>> is that the replaced resource is now the one we want to keep, and yet 
> >>> the replaced_by/replaces dependency will force the newer (replacement) 
> >>> resource to be checked for deletion first, which is an inversion of the 
> >>> usual order.
> >>>
> >>
> >> This is where the version is so handy! For UpdateReplaced ones, there is
> >> an older version to go back to. This version could just be template ID,
> >> as I mentioned in another e-mail. All resources are at the current
> >> template ID if they are found in the current template, even if they is
> >> no need to update them. Otherwise, they need to be cleaned-up in the
> >> order given in the previous templates.
> >>
> >> I think the template ID is used as version as far as I can see in Zane's
> >> PoC. If the resource template key doesn't match the current template
> >> key, the resource is deleted. The version is misnomer here, but that
> >> field (template id) is used as though we had versions of resources.
> >>
> >>> However, I tried to think of a scenario where that would cause problems 
> >>> and I couldn't come up with one. Provided we know the actual, real-world 
> >>> dependencies of each resource I don't think the ordering of those two 
> >>> checks matters.
> >>>
> >>> In fact, I currently can't think of a case where the dependency order 
> >>> between replacement and replaced resources matters at all. It matters in 
> >>> the current Heat implementation because resources are artificially 
> >>> segmented into the current and backup stacks, but with a holistic view 
> >>> of dependencies that may well not be required. I tried taking that line 
> >>> out of the simulator code and all the tests still passed. If anybody can 
> >>> think of a scenario in which it would make a difference, I would be very 
> >>> interested to hear it.
> >>>
> >>> In any event though, it should be no problem to reverse the direction of 
> >>> that one edge in these particular circumstances if it does turn out to 
> >>> be a problem.
> >>>
> >>>> We had an approach with depends_on
> >>>> and needed_by columns in ResourceTable. But dropped it when we figured out
> >>>> we had too many DB operations for Update.
> >>>
> >>> Yeah, I initially ran into this problem too - you have a bunch of nodes 
> >>> that are waiting on the current node, and now you have to go look them 
> >>> all up in the database to see what else they're waiting on in order to 
> >>> tell if they're ready to be triggered.
> >>>
> >>> It turns out the answer is to distribute the writes but centralise the 
> >>> reads. So at the start of the update, we read all of the Resources, 
> >>> obtain their dependencies and build one central graph[1]. We than make 
> >>> that graph available to each resource (either by passing it as a 
> >>> notification parameter, or storing it somewhere central in the DB that 
> >>> they will all have to read anyway, i.e. the Stack). But when we update a 
> >>> dependency we don't update the central graph, we update the individual 
> >>> Resource so there's no global lock required.
> >>>
> >>> [1] 
> >>> https://github.com/zaneb/heat-convergence-prototype/blob/distributed-graph/converge/stack.py#L166-L168
> >>>
> >>
> >> A centralized graph and decision making will make the implementation far
> >> more simpler than distributed. This looks academic, but the simplicity
> >> beats everything! When each worker has to decide, there needs to be
> >> lock, only DB transactions are not enough. In contrast, when the
> >> decision making is centralized, that particular critical section can be
> >> attempted with transaction and re-attempted if needed.
> >>
> > 
> > I'm concerned that we're losing sight of the whole point of convergence.
> > 
> > Yes, concurrency is hard, and state management is really the only thing
> > hard about concurrency.
> > 
> > What Zane is suggesting is a lock-free approach commonly called 'Read
> > Copy Update' or "RCU". It has high reader concurrency, but relies on
> > there only being one writer. It is quite simple, and has proven itself
> > enough to even be included in the Linux kernel:
> > 
> > http://lwn.net/Articles/263130/
> > 
> I am afraid Zane is saying the other way: "distribute the writes but
> centralize the reads". Nevertheless, what I meant by having a centralize

Oh I missed that the writes were happening by anybody, but the reads
weren't. Thanks.

> decision making is the same thing you are pointing to. By centralize I
> mean:
> 1. Have the graph in DB, not the edges distributed in the resource
> table. Also, having the graph in DB doesn't mean we need to lock it each
> time we update traversal information or compute next set of ready
> resources/tasks.
> 2. Take the responsibility of what-to-do-next out of worker. This is the
> "writer" part, where upon receiving a notification the engine will
> update the graph in DB and compute the next set of resources to be
> converged (having all dependencies resolved). I agree that there will be
> multiple instances of engine which could execute this section code (this
> code will become the critical section), so in that sense it is not
> centralized. But this approach differs from workers making the decision,
> where, each worker reads from DB and updates the sync point. Instead, if
> the workers send "done" notification to engine, the engine can update
> the graph traversal and issue request for next set of tasks. All the
> workers and observers read from DB, send notifications to engine, while
> the engine writes to DB. This may not be strictly followed. Please also
> note that updating a graph doesn't mean that we have to lock anything.
> As the notifications arrive, the graph is updated, in a TX, and
> re-attempted if the two notifications try to update the same row and the
> TX fails. The workers are *part of* engine but not really *are the
> engine*. Given an instance of engine, and multiple workers (greenlet
> threads), I can see that it turns out to be what you are suggesting
> about RCU. Please correct me if I am wrong.

So this is definitely interesting.

RCU works extremely well if you have a high ratio of reads to writes and
a natural single writer. For instance, if you have a game where there
is a scoring system and no readers do anything that depend on writes
that they cause themselves. The views of the scores are all read and
displayed, but updates to the score can just be thrown into an async
queue and when the writer gets to them, the score increases, and the
players display the new score the next time they read the score.

In this case, you have tightly coupled reads to writes, so I can say now
after thinking about it that RCU is not a great fit. Letting each entity
do an atomic transaction using the DB's row level locks will scale out
better than a single writer pattern.

I do see one advantage to having writes separate from workers. That is
the fact that the Heat database will likely scale differently than the
endpoints the plugins in Heat will. So it may make sense to have 100
workers doing calls to nova/swift/glance/cinder/etc., but you might only
have a database that can reasonably do 10 transactions at once.

This is premature optimization IMO, but it might make sense to have a
disciplined approach to the code so that moving those bits apart later
will be possible.

> I do not insist that it has to be the way I am saying. I am merely
> brainstorming to see if the "centralized" write makes sense in this
> case. I also admit that I do not have performance data, and I am not any
> DB expert.

Indeed, thanks for your thoughts, they are much appreciated. We have
quite a few experts that may be reading this thread so if we say
something stupid (likely) they will hopefully correct us. Until then we
can discuss things logically and test our hypotheses as necessary. :)

> >> With the distributed approach, I see following drawbacks:
> >> 1. Every time a resource is done, the peer resources (siblings) are
> >> checked to see if they are done and the parent is propagated. This
> >> happens for each resource.
> > 
> > This is slightly inaccurate.
> > 
> > Every time a resource is done, resources that depend on that resource
> > are checked to see if they still have any unresolved dependencies.
> > 
> Yes. If we have a graph in DB, we should be able to easily compute this.
> If the edges are distributed and kept along with the resources in the
> resources, we might have to execute multiple queries or keep the graph
> in memory.
> >> 2. The worker has to run through all the resources to see if the stack
> >> is done, to mark it as completed.
> > 
> > If a worker completes a resource which has no dependent resources, it
> > only needs to check to see if all of the other edges of the graph are
> > complete to mark the state as complete. There is no full traversal
> > unless you want to make sure nothing regressed, which is not the way any
> > of the specs read.
> > 
> I agree. For this same reason I am favoring to have the graph in DB. I
> also noted that Zane is not against keeping the graph in DB, but only
> that store the *actual* dependencies in resources (may be physical
> resource ID?). This is fine I think, though we will be looking at graph
> table some times(create/update) and looking at these dependencies in
> resource table at other times (delete/clean-up?). The graph can
> instantly tell if it is done or not when we look at it, but for
> clean-ups and delelts we would have to rely on resource table also.

It's important to get a schema that makes sense, but if need be, we can
adapt as we learn what does and doesn't work. I'd go with the simplest
schema that makes sense at first. IMO one table for one entity is better
than two tables for one entity. 1:1 relationships always seem to be
about optimization, not simplification.

> >> 3. The decision to converge is made by each worker resulting in lot of
> >> contention. The centralized graph restricts the contention point to one
> >> place where we can use DB transactions. It is easier to maintain code
> >> where particular decisions are made at a place rather than at many
> >> places.
> > 
> > Why not let multiple workers use DB transactions? The contention happens
> > _only if it needs to happen to preserve transaction consistency_ instead
> > of _always_.
> > 
> Sure! When the workers are done with the current resource they can
> update the DB and pick-up the parent if it is ready. The DB interactions
> can happen as a TX. But then there would be no
> one-writer-multiple-reader if we follow this. All the workers write and
> read.
> >> 4. The complex part we are trying to solve is to decide on what to do
> >> next when a resource is done. With centralized graph, this is abstracted
> >> out to the DB API. The API will return the next set of nodes. A smart
> >> SQL query can reduce a lot of logic currently being coded in
> >> worker/engine.
> > 
> > Having seen many such "smart" SQL queries, I have to say, this is
> > terrifying. Complexity in database access is by far the single biggest
> > obstacle to scaling out.
> > 
> > I don't really know why you'd want logic to move into the database. It
> > is the one place that you must keep simple in order to scale. We can
> > scale out python like crazy.. but SQL is generally a single threaded,
> > impossible to debug component. So make the schema obvious and accesses
> > to it straight forward.
> > 
> > I think we need to land somewhere between the two approaches though.
> > Here is my idea for DB interaction, I realize now it's been in my head
> > for a while but I never shared:
> > 
> > CREATE TABLE resource (
> >   id ...,
> >   ! all the stuff now
> >   version int,
> >   replaced_by int,
> >   complete_order int,
> >   primary key (id, version),
> >   key idx_replaced_by (replaced_by));
> > 
> > CREATE TABLE resource_dependencies (
> >   id ....,
> >   version int,
> >   needed_by ...
> >   primary key (id, version, needed_by));
> > 
> > Then completion happens something like this:
> > 
> > SELECT @complete_order := MAX(complete_order) FROM resource WHERE stack_id = :stack_id:
> > SET @complete_order := @complete_order + 1
> > UPDATE resource SET complete_order = @complete_order, state='COMPLETE' WHERE id=:id: AND version=:version:;
> > ! if there is a replaced_version
> > UPDATE resource SET replaced_by=:version: WHERE id=:id: AND version=:replaced_version:;
> > SELECT DISTINCT r.id FROM resource r INNER JOIN resource_dependencies rd
> >        ON r.id = rd.resource_id AND r.version = rd.version
> > WHERE r.version=:version: AND rd.needed_by=:id: AND r.state != 'COMPLETE'
> > <python>
> > for id in results:
> >   convergequeue.submit(id)
> > </python>
> > 
> > Perhaps I've missed some revelation that makes this hard or impossible.
> > But I don't see a ton of database churn (one update per completion is
> > meh). I also don't see a lot of complexity in the query. The
> > complete_order can be used to handle deletes in the right order later
> > (note that I know that is probably the wrong way to do it and sequences
> > are a thing that can be used for this).
> > 
> >> 5. What would be the starting point for resource clean-up? The clean-up
> >> has to start when all the resources are updated. With no centralized
> >> graph, the DB has to be searched for all the resources with no
> >> dependencies and with older versions (or having older template keys) and
> >> start removing them. With centralized graph, this would be a simpler
> >> with a SQL queries returning what needs to be done. The search space for
> >> where to start with clean-up will be huge.
> > 
> > "Searched" may be the wrong way. With the table structure above, you can
> > find everything to delete with this query:
> > 
> > ! Outright deletes
> > SELECT r_old.id
> > FROM resource r_old LEFT OUTER JOIN resource r_new
> >   ON r_old.id = r_new.id AND r_old.version = :cleanup_version:
> > WHERE r_new.id IS NULL OR r_old.replaced_by IS NOT NULL
> > ORDER BY DESC r.complete_order;
> > 
> > That should delete everything in more or less the right order. I think
> > for that one you can just delete the rows as they're confirmed deleted
> > from the plugins, no large transaction needed since we'd not expect
> > these rows to be updated anymore.
> > 
> Well, I meant simple SQL queries where we JOIN the graph table and
> resource table to see if a resource can be taken up for convergence. It
> is possible that the graph says a resource is ready since it has all the
> dependencies stisfied, but have a previous version still in progress
> from previous update. By smart, I never meant any complex queries, but
> only the ones which solve _our problem_. The queries which you have
> suggested above is what I meant. I was not sure we were using the DB in
> the way you are suggesting, hence called for utilizing it in a better
> way.

Understood, and thanks for enduring my fear of SQL logic bombs. :)

> >> 6. When engine restarts, the search space on where to start will be
> >> huge. With a centralized graph, the abstracted API to get next set of
> >> nodes makes the implementation of decision simpler.
> >>
> >> I am convinced enough that it is simpler to assign the responsibility to
> >> engine on what needs to be done next. No locks will be required, not
> >> even resource locks! It is simpler from implementation, understanding
> >> and maintenance perspective.
> >>
> > 
> > I thought you started saying you would need locks, but now saying you
> > won't.
> No. We wanted to get rid of the stack lock we were using to avoid the
> concurrency issues, and you had suggested using DB transactions. We had
> other ideas but DB TX looked cleaner to us and we are proceeding with
> it.

Awesome. :)

More information about the OpenStack-dev mailing list