[openstack-dev] [Heat] Convergence proof-of-concept showdown

Anant Patil anant.techie at gmail.com
Tue Dec 2 18:08:04 UTC 2014


On Tue, Dec 2, 2014 at 1:49 AM, Clint Byrum <clint at fewbar.com> wrote:

> Excerpts from Anant Patil's message of 2014-11-30 23:02:29 -0800:
> > On 27-Nov-14 18:03, Murugan, Visnusaran wrote:
> > > Hi Zane,
> > >
> > >
> > >
> > > At this stage our implementation (as mentioned in wiki
> > > <https://wiki.openstack.org/wiki/Heat/ConvergenceDesign>) achieves
> your
> > > design goals.
> > >
> > >
> > >
> > > 1.       In case of a parallel update, our implementation adjusts graph
> > > according to new template and waits for dispatched resource tasks to
> > > complete.
> > >
> > > 2.       Reason for basing our PoC on Heat code:
> > >
> > > a.       To solve contention processing parent resource by all
> dependent
> > > resources in parallel.
> > >
> > > b.      To avoid porting issue from PoC to HeatBase. (just to be aware
> > > of potential issues asap)
> > >
> > > 3.       Resource timeout would be helpful, but I guess its resource
> > > specific and has to come from template and default values from plugins.
> > >
> > > 4.       We see resource notification aggregation and processing next
> > > level of resources without contention and with minimal DB usage as the
> > > problem area. We are working on the following approaches in *parallel.*
> > >
> > > a.       Use a Queue per stack to serialize notification.
> > >
> > > b.      Get parent ProcessLog (ResourceID, EngineID) and initiate
> > > convergence upon first child notification. Subsequent children who fail
> > > to get parent resource lock will directly send message to waiting
> parent
> > > task (topic=stack_id.parent_resource_id)
> > >
> > > Based on performance/feedback we can select either or a mashed version.
> > >
> > >
> > >
> > > Advantages:
> > >
> > > 1.       Failed Resource tasks can be re-initiated after ProcessLog
> > > table lookup.
> > >
> > > 2.       One worker == one resource.
> > >
> > > 3.       Supports concurrent updates
> > >
> > > 4.       Delete == update with empty stack
> > >
> > > 5.       Rollback == update to previous know good/completed stack.
> > >
> > >
> > >
> > > Disadvantages:
> > >
> > > 1.       Still holds stackLock (WIP to remove with ProcessLog)
> > >
> > >
> > >
> > > Completely understand your concern on reviewing our code, since commits
> > > are numerous and there is change of course at places.  Our start commit
> > > is [c1b3eb22f7ab6ea60b095f88982247dd249139bf] though this might not
> help J
> > >
> > >
> > >
> > > Your Thoughts.
> > >
> > >
> > >
> > > Happy Thanksgiving.
> > >
> > > Vishnu.
> > >
> > >
> > >
> > > *From:*Angus Salkeld [mailto:asalkeld at mirantis.com]
> > > *Sent:* Thursday, November 27, 2014 9:46 AM
> > > *To:* OpenStack Development Mailing List (not for usage questions)
> > > *Subject:* Re: [openstack-dev] [Heat] Convergence proof-of-concept
> showdown
> > >
> > >
> > >
> > > On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter <zbitter at redhat.com
> > > <mailto:zbitter at redhat.com>> wrote:
> > >
> > >     A bunch of us have spent the last few weeks working independently
> on
> > >     proof of concept designs for the convergence architecture. I think
> > >     those efforts have now reached a sufficient level of maturity that
> > >     we should start working together on synthesising them into a plan
> > >     that everyone can forge ahead with. As a starting point I'm going
> to
> > >     summarise my take on the three efforts; hopefully the authors of
> the
> > >     other two will weigh in to give us their perspective.
> > >
> > >
> > >     Zane's Proposal
> > >     ===============
> > >
> > >
> https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph
> > >
> > >     I implemented this as a simulator of the algorithm rather than
> using
> > >     the Heat codebase itself in order to be able to iterate rapidly on
> > >     the design, and indeed I have changed my mind many, many times in
> > >     the process of implementing it. Its notable departure from a
> > >     realistic simulation is that it runs only one operation at a time -
> > >     essentially giving up the ability to detect race conditions in
> > >     exchange for a completely deterministic test framework. You just
> > >     have to imagine where the locks need to be. Incidentally, the test
> > >     framework is designed so that it can easily be ported to the actual
> > >     Heat code base as functional tests so that the same scenarios could
> > >     be used without modification, allowing us to have confidence that
> > >     the eventual implementation is a faithful replication of the
> > >     simulation (which can be rapidly experimented on, adjusted and
> > >     tested when we inevitably run into implementation issues).
> > >
> > >     This is a complete implementation of Phase 1 (i.e. using existing
> > >     resource plugins), including update-during-update, resource
> > >     clean-up, replace on update and rollback; with tests.
> > >
> > >     Some of the design goals which were successfully incorporated:
> > >     - Minimise changes to Heat (it's essentially a distributed version
> > >     of the existing algorithm), and in particular to the database
> > >     - Work with the existing plugin API
> > >     - Limit total DB access for Resource/Stack to O(n) in the number of
> > >     resources
> > >     - Limit overall DB access to O(m) in the number of edges
> > >     - Limit lock contention to only those operations actually
> contending
> > >     (i.e. no global locks)
> > >     - Each worker task deals with only one resource
> > >     - Only read resource attributes once
> > >
> > >
> > >     Open questions:
> > >     - What do we do when we encounter a resource that is in progress
> > >     from a previous update while doing a subsequent update? Obviously
> we
> > >     don't want to interrupt it, as it will likely be left in an unknown
> > >     state. Making a replacement is one obvious answer, but in many
> cases
> > >     there could be serious down-sides to that. How long should we wait
> > >     before trying it? What if it's still in progress because the engine
> > >     processing the resource already died?
> > >
> > >
> > >
> > > Also, how do we implement resource level timeouts in general?
> > >
> > >
> > >
> > >
> > >     Michał's Proposal
> > >     =================
> > >
> > >     https://github.com/inc0/heat-convergence-prototype/tree/iterative
> > >
> > >     Note that a version modified by me to use the same test scenario
> > >     format (but not the same scenarios) is here:
> > >
> > >
> https://github.com/zaneb/heat-convergence-prototype/tree/iterative-adapted
> > >
> > >     This is based on my simulation framework after a fashion, but with
> > >     everything implemented synchronously and a lot of handwaving about
> > >     how the actual implementation could be distributed. The central
> > >     premise is that at each step of the algorithm, the entire graph is
> > >     examined for tasks that can be performed next, and those are then
> > >     started. Once all are complete (it's synchronous, remember), the
> > >     next step is run. Keen observers will be asking how we know when it
> > >     is time to run the next step in a distributed version of this
> > >     algorithm, where it will be run and what to do about resources that
> > >     are in an intermediate state at that time. All of these questions
> > >     remain unanswered.
> > >
> > >
> > >
> > > Yes, I was struggling to figure out how it could manage an IN_PROGRESS
> > > state as it's stateless. So you end up treading on the other action's
> toes.
> > >
> > > Assuming we use the resource's state (IN_PROGRESS) you could get around
> > > that. Then you kick off a converge when ever an action completes (if
> > > there is nothing new to be
> > >
> > > done then do nothing).
> > >
> > >
> > >
> > >
> > >     A non-exhaustive list of concerns I have:
> > >     - Replace on update is not implemented yet
> > >     - AFAIK rollback is not implemented yet
> > >     - The simulation doesn't actually implement the proposed
> architecture
> > >     - This approach is punishingly heavy on the database - O(n^2) or
> worse
> > >
> > >
> > >
> > > Yes, re-reading the state of all resources when ever run a new converge
> > > is worrying, but I think Michal had some ideas to minimize this.
> > >
> > >
> > >
> > >     - A lot of phase 2 is mixed in with phase 1 here, making it
> > >     difficult to evaluate which changes need to be made first and
> > >     whether this approach works with existing plugins
> > >     - The code is not really based on how Heat works at the moment, so
> > >     there would be either a major redesign required or lots of radical
> > >     changes in Heat or both
> > >
> > >     I think there's a fair chance that given another 3-4 weeks to work
> > >     on this, all of these issues and others could probably be resolved.
> > >     The question for me at this point is not so much "if" but "why".
> > >
> > >     Michał believes that this approach will make Phase 2 easier to
> > >     implement, which is a valid reason to consider it. However, I'm not
> > >     aware of any particular issues that my approach would cause in
> > >     implementing phase 2 (note that I have barely looked into it at all
> > >     though). In fact, I very much want Phase 2 to be entirely
> > >     encapsulated by the Resource class, so that the plugin type (legacy
> > >     vs. convergence-enabled) is transparent to the rest of the system.
> > >     Only in this way can we be sure that we'll be able to maintain
> > >     support for legacy plugins. So a phase 1 that mixes in aspects of
> > >     phase 2 is actually a bad thing in my view.
> > >
> > >     I really appreciate the effort that has gone into this already, but
> > >     in the absence of specific problems with building phase 2 on top of
> > >     another approach that are solved by this one, I'm ready to call
> this
> > >     a distraction.
> > >
> > >
> > >
> > > In it's defence, I like the simplicity of it. The concepts and code are
> > > easy to understand - tho' part of this is doesn't implement all the
> > > stuff on your list yet.
> > >
> > >
> > >
> > >
> > >
> > >     Anant & Friends' Proposal
> > >     =========================
> > >
> > >     First off, I have found this very difficult to review properly
> since
> > >     the code is not separate from the huge mass of Heat code and nor is
> > >     the commit history in the form that patch submissions would take
> > >     (but rather includes backtracking and iteration on the design). As
> a
> > >     result, most of the information here has been gleaned from
> > >     discussions about the code rather than direct review. I have
> > >     repeatedly suggested that this proof of concept work should be done
> > >     using the simulator framework instead, unfortunately so far to no
> avail.
> > >
> > >     The last we heard on the mailing list about this, resource clean-up
> > >     had not yet been implemented. That was a major concern because that
> > >     is the more difficult half of the algorithm. Since then there have
> > >     been a lot more commits, but it's not yet clear whether resource
> > >     clean-up, update-during-update, replace-on-update and rollback have
> > >     been implemented, though it is clear that at least some progress
> has
> > >     been made on most or all of them. Perhaps someone can give us an
> update.
> > >
> > >
> > > https://github.com/anantpatil/heat-convergence-poc
> > >
> > >
> > >
> > >     AIUI this code also mixes phase 2 with phase 1, which is a concern.
> > >     For me the highest priority for phase 1 is to be sure that it works
> > >     with existing plugins. Not only because we need to continue to
> > >     support them, but because converting all of our existing
> > >     'integration-y' unit tests to functional tests that operate in a
> > >     distributed system is virtually impossible in the time frame we
> have
> > >     available. So the existing test code needs to stick around, and the
> > >     existing stack create/update/delete mechanisms need to remain in
> > >     place until such time as we have equivalent functional test
> coverage
> > >     to begin eliminating existing unit tests. (We'll also, of course,
> > >     need to have unit tests for the individual elements of the new
> > >     distributed workflow, functional tests to confirm that the
> > >     distributed workflow works in principle as a whole - the scenarios
> > >     from the simulator can help with _part_ of this - and, not least,
> an
> > >     algorithm that is as similar as possible to the current one so that
> > >     our existing tests remain at least somewhat representative and
> don't
> > >     require too many major changes themselves.)
> > >
> > >     Speaking of tests, I gathered that this branch included tests, but
> I
> > >     don't know to what extent there are automated end-to-end functional
> > >     tests of the algorithm?
> > >
> > >     From what I can gather, the approach seems broadly similar to the
> > >     one I eventually settled on also. The major difference appears to
> be
> > >     in how we merge two or more streams of execution (i.e. when one
> > >     resource depends on two or more others). In my approach, the
> > >     dependencies are stored in the resources and each joining of
> streams
> > >     creates a database row to track it, which is easily locked with
> > >     contention on the lock extending only to those resources which are
> > >     direct dependencies of the one waiting. In this approach, both the
> > >     dependencies and the progress through the graph are stored in a
> > >     database table, necessitating (a) reading of the entire table (as
> it
> > >     relates to the current stack) on every resource operation, and (b)
> > >     locking of the entire table (which is hard) when marking a resource
> > >     operation complete.
> > >
> > >     I chatted to Anant about this today and he mentioned that they had
> > >     solved the locking problem by dispatching updates to a queue that
> is
> > >     read by a single engine per stack.
> > >
> > >     My approach also has the neat side-effects of pushing the data
> > >     required to resolve get_resource and get_att (without having to
> > >     reload the resources again and query them) as well as to update
> > >     dependencies (e.g. because of a replacement or deletion) along with
> > >     the flow of triggers. I don't know if anything similar is at work
> here.
> > >
> > >     It's entirely possible that the best design might combine elements
> > >     of both approaches.
> > >
> > >     The same open questions I detailed under my proposal also apply to
> > >     this one, if I understand correctly.
> > >
> > >
> > >     I'm certain that I won't have represented everyone's work fairly
> > >     here, so I encourage folks to dive in and correct any errors about
> > >     theirs and ask any questions you might have about mine. (In case
> you
> > >     have been living under a rock, note that I'll be out of the office
> > >     for the rest of the week due to Thanksgiving so don't expect
> > >     immediate replies.)
> > >
> > >     I also think this would be a great time for the wider Heat
> community
> > >     to dive in and start asking questions and suggesting ideas. We need
> > >     to, ahem, converge on a shared understanding of the design so we
> can
> > >     all get to work delivering it for Kilo.
> > >
> > >
> > >
> > > Agree, we need to get moving on this.
> > >
> > > -Angus
> > >
> > >
> > >
> > >     cheers,
> > >     Zane.
> > >
> > >     _______________________________________________
> > >     OpenStack-dev mailing list
> > >     OpenStack-dev at lists.openstack.org
> > >     <mailto:OpenStack-dev at lists.openstack.org>
> > >     http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > OpenStack-dev mailing list
> > > OpenStack-dev at lists.openstack.org
> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >
> >
> > Thanks Zane for your e-mail Zane and summarizing everyone's work.
> >
> > The design goals mentioned above looks more of performance goals and
> > constraints to me. I understand that it is unacceptable to have a poorly
> > performing engine and Resource plug-ins broken. Convergence spec clearly
> > mentions that the existing Resource plugins should not be changed.
> >
> > IMHO, and my teams' HO, the design goals of convergence would be:
> > 1. Stability: No transient failures, either in Openstack/external
> > services or resources themselves, should fail the stack. Therefore, we
> > need to have Observers to check for divergence and converge a resource
> > if needed, to bring back to stable state.
> > 2. Resiliency: Heat engines should be able to take up tasks in case of
> > failures/restarts.
> > 3. Backward compatibility: "We don't break the user space." No existing
> > stacks should break.
> >
> > We started the PoC with these goals in mind, any performance
> > optimization would be a plus point for us. Note than I am neglecting the
> > performance goal, just that it should be next in the pipeline. The
> > questions we should ask ourselves is whether we are storing enough data
> > (state of stack) in DB to enable resiliency? Are we distributing the
> > load evenly to all Heat engines? Does our notification mechanism
> > provides us some form of guarantee or acknowledgement?
> >
> > In retrospective, we had to struggle a lot to understand the existing
> > Heat engine. We couldn't have done justice by just creating another
> > project in GitHub and without any concrete understanding of existing
> > state-of-affairs. We are not at the same page with Heat core members, we
> > are novice and cores are experts.
> >
> > I am glad that we experimented with the Heat engine directly. The
> > current Heat engine is not resilient and the messaging also lacks
> > reliability. We (my team and I guess cores also) understand that async
> > message passing would be the way to go as synchronous RPC calls simply
> > wouldn't scale. But with async message passing there has to be some
> > mechanism of ACKing back, which I think lacks in current infrastructure.
> >
> > How could we provide stable user defined stack if the underlying Heat
> > core lacks it? Convergence is all about stable stacks. To make the
> > current Heat core stable we need to have, at the least:
> > 1. Some mechanism to ACK back messages over AMQP. Or some other solid
> > mechanism of message passing.
> > 2. Some mechanism for fault tolerance in Heat engine using external
> > tools/infrastructure like Celerey/Zookeeper. Without external
> > infrastructure/tool we will end-up bloating Heat engine with lot of
> > boiler-plate code to achieve this. We had recommended Celery in our
> > previous e-mail (from Vishnu.)
> >
> > It was due to our experiments with Heat engines for this PoC, we could
> > come up with above recommendations.
> >
> > Sate of our PoC
> > ---------------
> >
> > On GitHub: https://github.com/anantpatil/heat-convergence-poc
> >
> > Our current implementation of PoC locks the stack after each
> > notification to mark the graph as traversed and produce next level of
> > resources for convergence. We are facing challenges in
> > removing/minimizing these locks. We also have two different school of
> > thoughts for solving this lock issue as mentioned above in Vishnu's
> > e-mail. I will these descibe in detail the Wiki. There would different
> > branches in our GitHub for these two approaches.
> >
>
> It would be helpful if you explained why you need to _lock_ the stack.
> MVCC in the database should be enough here. Basically you need to:
>
> begin transaction
> update traversal information
> select resolvable nodes
> {in code not sql -- send converge commands into async queue}
> commit
>
> Any failure inside this transation should rollback the transaction and
> retry this. It is o-k to have duplicate converge commands for a resource.
>
> This should be the single point of synchronization between workers that
> are resolving resources. Or perhaps this is the lock you meant? Either
> way, this isn't avoidable if you want to make sure everything is attempted
> at least once without having to continuously poll and re-poll the stack
> to look for unresolved resources. That is an option, but not one that I
> think is going to be as simple as the transactional method.
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20141202/edc1f2a4/attachment.html>


More information about the OpenStack-dev mailing list