Open Stack

Tue Dec 16 15:36:58 UTC 2014

On 16-Dec-14 00:59, Clint Byrum wrote:
> Excerpts from Anant Patil's message of 2014-12-15 07:15:30 -0800:
>> On 13-Dec-14 05:42, Zane Bitter wrote:
>>> On 12/12/14 05:29, Murugan, Visnusaran wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Zane Bitter [mailto:zbitter at redhat.com]
>>>>> Sent: Friday, December 12, 2014 6:37 AM
>>>>> To: openstack-dev at lists.openstack.org
>>>>> Subject: Re: [openstack-dev] [Heat] Convergence proof-of-concept
>>>>> showdown
>>>>>
>>>>> On 11/12/14 08:26, Murugan, Visnusaran wrote:
>>>>>>>> [Murugan, Visnusaran]
>>>>>>>> In case of rollback where we have to cleanup earlier version of
>>>>>>>> resources,
>>>>>>> we could get the order from old template. We'd prefer not to have a
>>>>>>> graph table.
>>>>>>>
>>>>>>> In theory you could get it by keeping old templates around. But that
>>>>>>> means keeping a lot of templates, and it will be hard to keep track
>>>>>>> of when you want to delete them. It also means that when starting an
>>>>>>> update you'll need to load every existing previous version of the
>>>>>>> template in order to calculate the dependencies. It also leaves the
>>>>>>> dependencies in an ambiguous state when a resource fails, and
>>>>>>> although that can be worked around it will be a giant pain to implement.
>>>>>>>
>>>>>>
>>>>>> Agree that looking to all templates for a delete is not good. But
>>>>>> baring Complexity, we feel we could achieve it by way of having an
>>>>>> update and a delete stream for a stack update operation. I will
>>>>>> elaborate in detail in the etherpad sometime tomorrow :)
>>>>>>
>>>>>>> I agree that I'd prefer not to have a graph table. After trying a
>>>>>>> couple of different things I decided to store the dependencies in the
>>>>>>> Resource table, where we can read or write them virtually for free
>>>>>>> because it turns out that we are always reading or updating the
>>>>>>> Resource itself at exactly the same time anyway.
>>>>>>>
>>>>>>
>>>>>> Not sure how this will work in an update scenario when a resource does
>>>>>> not change and its dependencies do.
>>>>>
>>>>> We'll always update the requirements, even when the properties don't
>>>>> change.
>>>>>
>>>>
>>>> Can you elaborate a bit on rollback.
>>>
>>> I didn't do anything special to handle rollback. It's possible that we 
>>> need to - obviously the difference in the UpdateReplace + rollback case 
>>> is that the replaced resource is now the one we want to keep, and yet 
>>> the replaced_by/replaces dependency will force the newer (replacement) 
>>> resource to be checked for deletion first, which is an inversion of the 
>>> usual order.
>>>
>>
>> This is where the version is so handy! For UpdateReplaced ones, there is
>> an older version to go back to. This version could just be template ID,
>> as I mentioned in another e-mail. All resources are at the current
>> template ID if they are found in the current template, even if they is
>> no need to update them. Otherwise, they need to be cleaned-up in the
>> order given in the previous templates.
>>
>> I think the template ID is used as version as far as I can see in Zane's
>> PoC. If the resource template key doesn't match the current template
>> key, the resource is deleted. The version is misnomer here, but that
>> field (template id) is used as though we had versions of resources.
>>
>>> However, I tried to think of a scenario where that would cause problems 
>>> and I couldn't come up with one. Provided we know the actual, real-world 
>>> dependencies of each resource I don't think the ordering of those two 
>>> checks matters.
>>>
>>> In fact, I currently can't think of a case where the dependency order 
>>> between replacement and replaced resources matters at all. It matters in 
>>> the current Heat implementation because resources are artificially 
>>> segmented into the current and backup stacks, but with a holistic view 
>>> of dependencies that may well not be required. I tried taking that line 
>>> out of the simulator code and all the tests still passed. If anybody can 
>>> think of a scenario in which it would make a difference, I would be very 
>>> interested to hear it.
>>>
>>> In any event though, it should be no problem to reverse the direction of 
>>> that one edge in these particular circumstances if it does turn out to 
>>> be a problem.
>>>
>>>> We had an approach with depends_on
>>>> and needed_by columns in ResourceTable. But dropped it when we figured out
>>>> we had too many DB operations for Update.
>>>
>>> Yeah, I initially ran into this problem too - you have a bunch of nodes 
>>> that are waiting on the current node, and now you have to go look them 
>>> all up in the database to see what else they're waiting on in order to 
>>> tell if they're ready to be triggered.
>>>
>>> It turns out the answer is to distribute the writes but centralise the 
>>> reads. So at the start of the update, we read all of the Resources, 
>>> obtain their dependencies and build one central graph[1]. We than make 
>>> that graph available to each resource (either by passing it as a 
>>> notification parameter, or storing it somewhere central in the DB that 
>>> they will all have to read anyway, i.e. the Stack). But when we update a 
>>> dependency we don't update the central graph, we update the individual 
>>> Resource so there's no global lock required.
>>>
>>> [1] 
>>> https://github.com/zaneb/heat-convergence-prototype/blob/distributed-graph/converge/stack.py#L166-L168
>>>
>>
>> A centralized graph and decision making will make the implementation far
>> more simpler than distributed. This looks academic, but the simplicity
>> beats everything! When each worker has to decide, there needs to be
>> lock, only DB transactions are not enough. In contrast, when the
>> decision making is centralized, that particular critical section can be
>> attempted with transaction and re-attempted if needed.
>>
> 
> I'm concerned that we're losing sight of the whole point of convergence.
> 
> Yes, concurrency is hard, and state management is really the only thing
> hard about concurrency.
> 
> What Zane is suggesting is a lock-free approach commonly called 'Read
> Copy Update' or "RCU". It has high reader concurrency, but relies on
> there only being one writer. It is quite simple, and has proven itself
> enough to even be included in the Linux kernel:
> 
> http://lwn.net/Articles/263130/
> 
I am afraid Zane is saying the other way: "distribute the writes but
centralize the reads". Nevertheless, what I meant by having a centralize
decision making is the same thing you are pointing to. By centralize I
mean:
1. Have the graph in DB, not the edges distributed in the resource
table. Also, having the graph in DB doesn't mean we need to lock it each
time we update traversal information or compute next set of ready
resources/tasks.
2. Take the responsibility of what-to-do-next out of worker. This is the
"writer" part, where upon receiving a notification the engine will
update the graph in DB and compute the next set of resources to be
converged (having all dependencies resolved). I agree that there will be
multiple instances of engine which could execute this section code (this
code will become the critical section), so in that sense it is not
centralized. But this approach differs from workers making the decision,
where, each worker reads from DB and updates the sync point. Instead, if
the workers send "done" notification to engine, the engine can update
the graph traversal and issue request for next set of tasks. All the
workers and observers read from DB, send notifications to engine, while
the engine writes to DB. This may not be strictly followed. Please also
note that updating a graph doesn't mean that we have to lock anything.
As the notifications arrive, the graph is updated, in a TX, and
re-attempted if the two notifications try to update the same row and the
TX fails. The workers are *part of* engine but not really *are the
engine*. Given an instance of engine, and multiple workers (greenlet
threads), I can see that it turns out to be what you are suggesting
about RCU. Please correct me if I am wrong.

I do not insist that it has to be the way I am saying. I am merely
brainstorming to see if the "centralized" write makes sense in this
case. I also admit that I do not have performance data, and I am not any
DB expert.

>> With the distributed approach, I see following drawbacks:
>> 1. Every time a resource is done, the peer resources (siblings) are
>> checked to see if they are done and the parent is propagated. This
>> happens for each resource.
> 
> This is slightly inaccurate.
> 
> Every time a resource is done, resources that depend on that resource
> are checked to see if they still have any unresolved dependencies.
> 

Yes. If we have a graph in DB, we should be able to easily compute this.
If the edges are distributed and kept along with the resources in the
resources, we might have to execute multiple queries or keep the graph
in memory.

>> 2. The worker has to run through all the resources to see if the stack
>> is done, to mark it as completed.
> 
> If a worker completes a resource which has no dependent resources, it
> only needs to check to see if all of the other edges of the graph are
> complete to mark the state as complete. There is no full traversal
> unless you want to make sure nothing regressed, which is not the way any
> of the specs read.
> 

I agree. For this same reason I am favoring to have the graph in DB. I
also noted that Zane is not against keeping the graph in DB, but only
that store the *actual* dependencies in resources (may be physical
resource ID?). This is fine I think, though we will be looking at graph
table some times(create/update) and looking at these dependencies in
resource table at other times (delete/clean-up?). The graph can
instantly tell if it is done or not when we look at it, but for
clean-ups and delelts we would have to rely on resource table also.

>> 3. The decision to converge is made by each worker resulting in lot of
>> contention. The centralized graph restricts the contention point to one
>> place where we can use DB transactions. It is easier to maintain code
>> where particular decisions are made at a place rather than at many
>> places.
> 
> Why not let multiple workers use DB transactions? The contention happens
> _only if it needs to happen to preserve transaction consistency_ instead
> of _always_.
> 

Sure! When the workers are done with the current resource they can
update the DB and pick-up the parent if it is ready. The DB interactions
can happen as a TX. But then there would be no
one-writer-multiple-reader if we follow this. All the workers write and
read.

>> 4. The complex part we are trying to solve is to decide on what to do
>> next when a resource is done. With centralized graph, this is abstracted
>> out to the DB API. The API will return the next set of nodes. A smart
>> SQL query can reduce a lot of logic currently being coded in
>> worker/engine.
> 
> Having seen many such "smart" SQL queries, I have to say, this is
> terrifying. Complexity in database access is by far the single biggest
> obstacle to scaling out.
> 
> I don't really know why you'd want logic to move into the database. It
> is the one place that you must keep simple in order to scale. We can
> scale out python like crazy.. but SQL is generally a single threaded,
> impossible to debug component. So make the schema obvious and accesses
> to it straight forward.
> 
> I think we need to land somewhere between the two approaches though.
> Here is my idea for DB interaction, I realize now it's been in my head
> for a while but I never shared:
> 
> CREATE TABLE resource (
>   id ...,
>   ! all the stuff now
>   version int,
>   replaced_by int,
>   complete_order int,
>   primary key (id, version),
>   key idx_replaced_by (replaced_by));
> 
> CREATE TABLE resource_dependencies (
>   id ....,
>   version int,
>   needed_by ...
>   primary key (id, version, needed_by));
> 
> Then completion happens something like this:
> 
> BEGIN
> SELECT @complete_order := MAX(complete_order) FROM resource WHERE stack_id = :stack_id:
> SET @complete_order := @complete_order + 1
> UPDATE resource SET complete_order = @complete_order, state='COMPLETE' WHERE id=:id: AND version=:version:;
> ! if there is a replaced_version
> UPDATE resource SET replaced_by=:version: WHERE id=:id: AND version=:replaced_version:;
> SELECT DISTINCT r.id FROM resource r INNER JOIN resource_dependencies rd
>        ON r.id = rd.resource_id AND r.version = rd.version
> WHERE r.version=:version: AND rd.needed_by=:id: AND r.state != 'COMPLETE'
> <python>
> for id in results:
>   convergequeue.submit(id)
> </python>
> COMMIT
> 
> Perhaps I've missed some revelation that makes this hard or impossible.
> But I don't see a ton of database churn (one update per completion is
> meh). I also don't see a lot of complexity in the query. The
> complete_order can be used to handle deletes in the right order later
> (note that I know that is probably the wrong way to do it and sequences
> are a thing that can be used for this).
> 
>> 5. What would be the starting point for resource clean-up? The clean-up
>> has to start when all the resources are updated. With no centralized
>> graph, the DB has to be searched for all the resources with no
>> dependencies and with older versions (or having older template keys) and
>> start removing them. With centralized graph, this would be a simpler
>> with a SQL queries returning what needs to be done. The search space for
>> where to start with clean-up will be huge.
> 
> "Searched" may be the wrong way. With the table structure above, you can
> find everything to delete with this query:
> 
> ! Outright deletes
> SELECT r_old.id
> FROM resource r_old LEFT OUTER JOIN resource r_new
>   ON r_old.id = r_new.id AND r_old.version = :cleanup_version:
> WHERE r_new.id IS NULL OR r_old.replaced_by IS NOT NULL
> ORDER BY DESC r.complete_order;
> 
> That should delete everything in more or less the right order. I think
> for that one you can just delete the rows as they're confirmed deleted
> from the plugins, no large transaction needed since we'd not expect
> these rows to be updated anymore.
> 

Well, I meant simple SQL queries where we JOIN the graph table and
resource table to see if a resource can be taken up for convergence. It
is possible that the graph says a resource is ready since it has all the
dependencies stisfied, but have a previous version still in progress
from previous update. By smart, I never meant any complex queries, but
only the ones which solve _our problem_. The queries which you have
suggested above is what I meant. I was not sure we were using the DB in
the way you are suggesting, hence called for utilizing it in a better
way.

>> 6. When engine restarts, the search space on where to start will be
>> huge. With a centralized graph, the abstracted API to get next set of
>> nodes makes the implementation of decision simpler.
>>
>> I am convinced enough that it is simpler to assign the responsibility to
>> engine on what needs to be done next. No locks will be required, not
>> even resource locks! It is simpler from implementation, understanding
>> and maintenance perspective.
>>
> 
> I thought you started saying you would need locks, but now saying you
> won't.

No. We wanted to get rid of the stack lock we were using to avoid the
concurrency issues, and you had suggested using DB transactions. We had
other ideas but DB TX looked cleaner to us and we are proceeding with
it.

> I agree no abstract locking is needed, just a consistent view of
> the graph in the DB.
> 

+1.

> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 

Open Stack

[openstack-dev] [Heat] Convergence proof-of-concept showdown

OpenStack

Community

Documentation

Branding & Legal