[openstack-dev] [Heat] Convergence proof-of-concept showdown
Zane Bitter
zbitter at redhat.com
Tue Dec 16 04:11:41 UTC 2014
On 15/12/14 09:32, Anant Patil wrote:
> On 12-Dec-14 06:29, Zane Bitter wrote:
>> On 11/12/14 01:14, Anant Patil wrote:
>>> On 04-Dec-14 10:49, Zane Bitter wrote:
>>>> On 01/12/14 02:02, Anant Patil wrote:
>>>>> On GitHub:https://github.com/anantpatil/heat-convergence-poc
>>>>
>>>> I'm trying to review this code at the moment, and finding some stuff I
>>>> don't understand:
>>>>
>>>> https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916
>>>>
>>>> This appears to loop through all of the resources *prior* to kicking off
>>>> any actual updates to check if the resource will change. This is
>>>> impossible to do in general, since a resource may obtain a property
>>>> value from an attribute of another resource and there is no way to know
>>>> whether an update to said other resource would cause a change in the
>>>> attribute value.
>>>>
>>>> In addition, no attempt to catch UpdateReplace is made. Although that
>>>> looks like a simple fix, I'm now worried about the level to which this
>>>> code has been tested.
>>>>
>>> We were working on new branch and as we discussed on Skype, we have
>>> handled all these cases. Please have a look at our current branch:
>>> https://github.com/anantpatil/heat-convergence-poc/tree/graph-version
>>>
>>> When a new resource is taken for convergence, its children are loaded
>>> and the resource definition is re-parsed. The frozen resource definition
>>> will have all the "get_attr" resolved.
>>>
>>>>
>>>> I'm also trying to wrap my head around how resources are cleaned up in
>>>> dependency order. If I understand correctly, you store in the
>>>> ResourceGraph table the dependencies between various resource names in
>>>> the current template (presumably there could also be some left around
>>>> from previous templates too?). For each resource name there may be a
>>>> number of rows in the Resource table, each with an incrementing version.
>>>> As far as I can tell though, there's nowhere that the dependency graph
>>>> for _previous_ templates is persisted? So if the dependency order
>>>> changes in the template we have no way of knowing the correct order to
>>>> clean up in any more? (There's not even a mechanism to associate a
>>>> resource version with a particular template, which might be one avenue
>>>> by which to recover the dependencies.)
>>>>
>>>> I think this is an important case we need to be able to handle, so I
>>>> added a scenario to my test framework to exercise it and discovered that
>>>> my implementation was also buggy. Here's the fix:
>>>> https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40
>>>>
>>>
>>> Thanks for pointing this out Zane. We too had a buggy implementation for
>>> handling inverted dependency. I had a hard look at our algorithm where
>>> we were continuously merging the edges from new template into the edges
>>> from previous updates. It was an optimized way of traversing the graph
>>> in both forward and reverse order with out missing any resources. But,
>>> when the dependencies are inverted, this wouldn't work.
>>>
>>> We have changed our algorithm. The changes in edges are noted down in
>>> DB, only the delta of edges from previous template is calculated and
>>> kept. At any given point of time, the graph table has all the edges from
>>> current template and delta from previous templates. Each edge has
>>> template ID associated with it.
>>
>> The thing is, the cleanup dependencies aren't really about the template.
>> The real resources really depend on other real resources. You can't
>> delete a Volume before its VolumeAttachment, not because it says so in
>> the template but because it will fail if you try. The template can give
>> us a rough guide in advance to what those dependencies will be, but if
>> that's all we keep then we are discarding information.
>>
>> There may be multiple versions of a resource corresponding to one
>> template version. Even worse, the actual dependencies of a resource
>> change on a smaller time scale than an entire stack update (this is the
>> reason the current implementation updates the template one resource at a
>> time as we go).
>>
>
> Absolutely! The edges from the template are kept only for the reference
> purposes. When we have a resource in new template, its template ID will
> also be marked to current template. At any point of time, realized
> resource will from current template, even if it were found in previous
> templates. The template ID "moves" for a resource if it is found.
In theory (disclaimer: I didn't implement this yet) it can change on an
even smaller timescale than that. The existing plugins are something of
a black box to us: if a failure occurs we don't necessarily know whether
the real-world dependency is on the old or new version of another resource.
>> Given that our Resource entries in the DB are in 1:1 correspondence with
>> actual resources (we create a new one whenever we need to replace the
>> underlying resource), I found it makes the most conceptual and practical
>> sense to store the requirements in the resource itself, and update them
>> at the time they actually change in the real world (bonus: introduces no
>> new locking issues and no extra DB writes). I settled on this after a
>> legitimate attempt at trying other options, but they didn't work out:
>> https://github.com/zaneb/heat-convergence-prototype/commit/a62958342e8583f74e2aca90f6239ad457ba984d
>>
>
> I am okay with the notion of graph stored in resource table.
>
>>> For resource clean up, we start from the
>>> first template (template which was completed and updates were made on
>>> top of it, empty template otherwise), and move towards the current
>>> template in the order in which the updates were issued, and for each
>>> template the graph (edges if found for the template) is traversed in
>>> reverse order and resources are cleaned-up.
>>
>> I'm pretty sure this is backwards - you'll need to clean up newer
>> resources first because they may reference resources from older
>> templates. Also if you have a stubborn old resource that won't delete
>> you don't want that to block cleanups of anything newer.
>>
>> You're also serialising some of the work unnecessarily because you've
>> discarded the information about dependencies that cross template
>> versions, forcing you to clean up only one template version at a time.
>>
>
> Since only the changed set is stored, it is going to be smaller in most
> of the cases. Within each previous template, if there are any edges,
> they are tried for clean-up in reverse order. If a resource is not
> updated, but available in new template, the template ID will be current
> template ID as I mentioned above. With the template, the process is
> concurrent, but across the templates, it will be serial. I am okay with
> this since in real world, the change in dependencies may not be as much
> as we think.
I agree the serialisation probably won't make a big difference in most
cases, although autoscaling is an obvious and important one where
serialising a lot of things unnecessarily could have a big impact on the
user.
>>> The process ends up with
>>> current template being traversed in reverse order and resources being
>>> cleaned up. All the update-replaced resources from the older templates
>>> (older updates in concurrent updates) are cleaned up in the order in
>>> which they are suppose to be.
>>>
>>> Resources are now tied to template, they have a template_id instead of
>>> version. As we traverse the graph, we know which template we are working
>>> on, and can take the relevant action on resource.
>>>
>>> For rollback, another update is issued with the last completed template
>>> (it is designed to have an empty template as last completed template by
>>> default). The current template being worked on becomes predecessor for
>>> the new incoming template. In case of rollback, the last completed
>>> template becomes incoming new template, the current becomes the new
>>> template's predecessor and the successor of last completed template will
>>> have no predecessor. All these changes are available in the
>>> 'graph-version' branch. (The branch name is a misnomer though!)
>>>
>>> I think it is simpler to think about stack and concurrent updates when
>>> we associate resources and edges with template, and stack with current
>>> template and its predecessors (if any).
>>
>> It doesn't seem simple to me because it's trying to reconstruct reality
>> from a lossy version of history. The simplest way to think about it, in
>> my opinion is this:
>> - When updating resources, respect their dependencies as given in the
>> template
>> - When checking resources to clean up, respect their actual, current
>> real-world dependencies, and check replacement resources before the
>> resources that they replaced.
>> - Don't check a resource for clean up until it has been updated to the
>> latest template.
>>
>>> I also think that we should decouple Resource from Stack. This is really
>>> a hindrance when workers work on individual resources. The resource
>>> should be abstracted enough from stack for the worker to work on the
>>> resource alone. The worker should load the required resource plug-in and
>>> start converging.
>>
>> I think that's a worthy goal, and it would be really nice if we could
>> load a Resource completely independently of its Stack, and I know this
>> has always been a design goal of yours (hence you're caching the
>> resource definition in the Resource rather than getting it from the
>> template).
>>
>> That said, I am convinced it's an unachievable goal, and I believe we
>> should give up on it.
>>
>> - We'll always need to load _some_ central thing (e.g. to find out if
>> the current traversal is still the valid one), so it might as well be
>> the Stack.
>> - Existing plugin abominations like HARestarter expect a working Stack
>> object to be provided so it can go hunting for other resources.
>>
>> I think the best we can do is try to make heat.engine.stack.Stack as
>> lazy as possible so that it only does extra work when strictly required,
>> and just accept that the stack will always be loaded from the database.
>>
>> I am also strongly in favour of treating the idea of caching the
>> unresolved resource definition in the Resource table as a straight
>> performance optimisation that is completely separate to the convergence
>> work. It's going to be inevitably ugly because there is no
>> template-format-independent way to serialise a resource definition
>> (while resource definition objects themselves are designed to be
>> inherently template-format-independent). Once phase 1 is complete we can
>> decide whether it's worth it based on measuring the actual performance
>> improvement.
>>
>
> The idea of keeping the resource definition in DB was to decouple from
> the stack as much as possible. It is not for performance optimization.
OK, that's fair, but then you said...
> When a worker job is received, the worker won't have to load the entire
> template from the stack, get the realized definitions etc.
...that's a performance optimisation.
> The worker
> simply loads the resource with definition with minimal stack (lazily
> loaded stack), and starts working on it. This work is towards making
> stack lazily loaded.
>
> Like you said, the stack has to be minimal when it is loaded. There will
> be few information needed by the worker (like resource definition and
> requirers/requirements) which can be obtained without loading the
> template and recalculating the graph again and again.
None of this changes the fundamental design, so let's get the important
stuff done first and when that's finished we can see if this makes a
meaningful (positive) difference.
>> (Note that we _already_ store the _resolved_ properties of the resource,
>> which is what the observer will be comparing against for phase 2, so
>> there should be no reason for the observer to need to load the stack.)
>>
>>> The READEME.rst is really helpful for bringing up the minimal devstack
>>> and test the PoC. I also has some notes on design.
>>>
>> [snip]
>>>
>>> Zane, I have few questions:
>>> 1. Our current implementation is based on notifications from worker so
>>> that the engine can take up next set of tasks. I don't see this in your
>>> case. I think we should be doing this. It gels well with observer
>>> notification mechanism. When the observer comes, it would send a
>>> converge notification. Both, the provisioning of stack and the
>>> continuous observation, happens with notifications (async message
>>> passing). I see that the workers in your case pick up the parent when/if
>>> it is done and schedules it or updates the sync point.
>>
>> I'm not quite sure what you're asking here, so forgive me if I'm
>> misunderstanding. What I think you're saying is that where my prototype
>> propagates notifications thus:
>>
>> worker -> worker
>> -> worker
>>
>> (where -> is an async message)
>> you would prefer it to do:
>>
>> worker -> engine -> worker
>> -> worker
>>
>> Is that right?
>>
>> To me the distinction seems somewhat academic, given that we've decided
>> that the engine and the worker will be the same process. I don't see a
>> disadvantage to doing right away stuff that we know needs to be done
>> right away. Obviously we should factor the code out tidily into a
>> separate method where we can _also_ expose it as a notification that can
>> be triggered by the continuous observer.
>>
>> You mentioned above that you thought the workers should not ever load
>> the Stack, and I think that's probably the reason you favour this
>> approach: the 'worker' would always load just the Resource and the
>> 'engine' (even though they're really the same) would always load just
>> the Stack, right?
>>
>> However, as I mentioned above, I think we'll want/have to load the Stack
>> in the worker anyway, so eliminating the extra asynchronous call
>> eliminates the performance penalty for having to do so.
>>
>
> When we started, we had the following idea:
>
>
> (provisions with
> converge API) (Observe)
> Engine ------------------> Worker --------------> Observer
> | | | |
> | | | |
> ^ (done) v ^ (converge) v
> ------------------------ ------------------------
>
> The boundary has to be clearly demarcated, as done above. While
> provisioning a stack, the Engine uses the converge facility, as does the
> observer when it sees a need to converge a resource.
>
> Even though we have these logical things in same process, by not clearly
> demarcating the responsibilities of each logical entity, we will end-up
> with issues which we face when we mix the responsibilities.
Sure, and that's why we don't write all the code in one big method, but
it doesn't follow that it has to be in a different process. Or in this
case, since they're actually the same process anyway, it doesn't imply
that they have to be called over RPC.
> As you mentioned, the worker should have a notification API exposed
> which can be used not only by the Observer, but also by the engine to
> provision (CRUD) a stack resource.
yep
> The engine decides on which resource
> to be converged next when a resource is done.
Who says it has to? Each resource knows what depends on it... let it
send the notifications directly. The engine is just there to do whatever
setup requires everything to be loaded at once and then kick things off
by starting the worker tasks for resources that don't depend on anything.
> The observer is only
> responsible for once resource and issues a converge request if it has
> to.
>
>>> 2. The dependency graph travels everywhere. IMHO, we can keep the graph
>>> in DB and let the workers work on a resource, and engine decide which
>>> one to be scheduled next by looking at the graph. There wouldn't be a
>>> need for a lock here, in the engine, the DB transactions should take
>>> care of concurrent DB updates. Our new PoC follows this model.
>>
>> I'm fine with keeping the graph in the DB instead of having it flow with
>> the notifications.
>>
>>> 3. The request ID is passed down to check_*_complete. Would the check
>>> method be interrupted if new request arrives? IMHO, the check method
>>> should not get interrupted. It should return back when the resource has
>>> reached a concrete state, either failed or completed.
>>
>> I agree, it should not be interrupted.
>>
>> I've started to think of phase 1 and phase 2 like this:
>>
>> 1) Make locks more granular: stack-level lock becomes resource-level
>> 2) Get rid of locks altogether
>>
>
> I thought that the DB transactions were enough for concurrency issues.
> The resources status is enough to know whether to trigger the update or
> not. For resources currently in progress, the engine can wait for
> notifications from then and trigger the update. With completely
> distributed graph and decision making it is hard to achieve. If the
> graph is stored in DB, the decision to trigger update is at one place
> (by looking at the graph only) and hence locks are not required.
Right, right, but it's still a lock (maybe I should have said mutex)
because only one traversal can be operating on each resource at a time.
We're just arguing semantics now. It doesn't matter whether you do
select-for-update on the Resource table to mark it as in progress or if
you do select-for-update on a separate Lock table to mark it as in
progress, only one update can be working on a given resource at a time,
and this is a big improvement in granularity because before only one
update could be working on the whole stack at a time, but not as big an
improvement as phase 2 will be.
>> So in phase 1 we'll lock the resources and like you said, it will return
>> back when it has reached a concrete state. In phase 2 we'll be able to
>> just update the goal state for the resource and the observe/converge
>> process will be able to automagically find the best way to that state
>> regardless of whether it was in the middle of transitioning to another
>> state or not. Or something. But that's for the future :)
>>
>
> I am thinking about not having any locks at all. The concurrency is
> handled by DB transaction and notifications gives us hint on deciding
> when and whether to update the next set of resources. If a resource is
> in progress currently, and an update is issued on that resource, the
> graph API doesn't return it as a ready-to-be-converged resource as a
> previous version is in progress. When it is done, and notification is
> received, the graph DB API returns it as next to converge.
So sort of a hybrid with Michal's approach.
I think I get what you're saying here. If we reach a particular resource
in the traverse and it is still in-progress from a previous update we
can ignore it and use the notification from the previous update
finishing to trigger it later, provided that the notifications go
through the engine because the in-progress resource can't possibly know
anything about traversals that started later.
Or on the other hand we could just sleep for a while until it's done and
then carry on.
>>> 4. Lot of synchronization issues which we faced in our PoC cannot be
>>> encountered with the framework. How do we evaluate what happens when
>>> synchronization issues are encountered (like stack lock kind of issues
>>> which we are replacing with DB transaction).
>>
>> Right, yeah, this is obviously the big known limitation of the
>> simulator. I don't have a better answer other than to Think Very Hard
>> about it.
>>
>> Designing software means solving for hundreds of constraints - too many
>> for a human to hold in their brain at the same time. The purpose of
>> prototyping is to fix enough of the responses to those constraints in a
>> concrete form to allow reasoning about the remaining ones to become
>> tractable. If you fix solutions for *all* of the constraints, then what
>> you've built is by definition not a prototype but the final product.
>>
>> One technique available to us is to encapsulate the parts of the
>> algorithm that are subject to synchronisation issues behind abstractions
>> that offer stronger guarantees. Then in order to have confidence in the
>> design we need only satisfy ourselves that we have analysed the
>> guarantees correctly and that a concrete implementation offering those
>> same guarantees is possible. For example, the SyncPoints are shown to
>> work under the assumption that they are not subject to race conditions,
>> and the SyncPoint code is small enough that we can easily see that it
>> can be implemented in an atomic fashion using the same DB primitives
>> already proven to work by StackLock. Therefore we can have a very high
>> confidence (but not proof) that the overall algorithm will work when
>> implemented in the final product.
>>
>> Having Thought Very Hard about it, I'm as confident as I can be that I'm
>> not relying on any synchronisation properties that can't be implemented
>> using select-for-update on a single database row. There will of course
>> be surprises at implementation time, but I hope that won't be one of
>> them and anticipate that any changes required to the plan will be
>> localised and not wide-ranging.
>>
>> (This is in contrast BTW to my centralised-graph branch, linked above,
>> where it became very obvious that it would require some sort of external
>> locking - so there is reason to think that this process can reveal
>> architectural problems related to synchronisation where they are present.)
>>
>> cheers,
>> Zane.
More information about the OpenStack-dev
mailing list