<div dir="ltr"><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 2, 2014 at 1:49 AM, Clint Byrum <span dir="ltr"><<a href="mailto:clint@fewbar.com" target="_blank">clint@fewbar.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Excerpts from Anant Patil's message of 2014-11-30 23:02:29 -0800:<br>

<div><div class="h5">> On 27-Nov-14 18:03, Murugan, Visnusaran wrote:<br>

> > Hi Zane,<br>

> ><br>

> ><br>

> ><br>

> > At this stage our implementation (as mentioned in wiki<br>

> > <<a href="https://wiki.openstack.org/wiki/Heat/ConvergenceDesign" target="_blank">https://wiki.openstack.org/wiki/Heat/ConvergenceDesign</a>>) achieves your<br>

> > design goals.<br>

> ><br>

> ><br>

> ><br>

> > 1.       In case of a parallel update, our implementation adjusts graph<br>

> > according to new template and waits for dispatched resource tasks to<br>

> > complete.<br>

> ><br>

> > 2.       Reason for basing our PoC on Heat code:<br>

> ><br>

> > a.       To solve contention processing parent resource by all dependent<br>

> > resources in parallel.<br>

> ><br>

> > b.      To avoid porting issue from PoC to HeatBase. (just to be aware<br>

> > of potential issues asap)<br>

> ><br>

> > 3.       Resource timeout would be helpful, but I guess its resource<br>

> > specific and has to come from template and default values from plugins.<br>

> ><br>

> > 4.       We see resource notification aggregation and processing next<br>

> > level of resources without contention and with minimal DB usage as the<br>

> > problem area. We are working on the following approaches in *parallel.*<br>

> ><br>

> > a.       Use a Queue per stack to serialize notification.<br>

> ><br>

> > b.      Get parent ProcessLog (ResourceID, EngineID) and initiate<br>

> > convergence upon first child notification. Subsequent children who fail<br>

> > to get parent resource lock will directly send message to waiting parent<br>

> > task (topic=stack_id.parent_resource_id)<br>

> ><br>

> > Based on performance/feedback we can select either or a mashed version.<br>

> ><br>

> ><br>

> ><br>

> > Advantages:<br>

> ><br>

> > 1.       Failed Resource tasks can be re-initiated after ProcessLog<br>

> > table lookup.<br>

> ><br>

> > 2.       One worker == one resource.<br>

> ><br>

> > 3.       Supports concurrent updates<br>

> ><br>

> > 4.       Delete == update with empty stack<br>

> ><br>

> > 5.       Rollback == update to previous know good/completed stack.<br>

> ><br>

> ><br>

> ><br>

> > Disadvantages:<br>

> ><br>

> > 1.       Still holds stackLock (WIP to remove with ProcessLog)<br>

> ><br>

> ><br>

> ><br>

> > Completely understand your concern on reviewing our code, since commits<br>

> > are numerous and there is change of course at places.  Our start commit<br>

> > is [c1b3eb22f7ab6ea60b095f88982247dd249139bf] though this might not help J<br>

> ><br>

> ><br>

> ><br>

> > Your Thoughts.<br>

> ><br>

> ><br>

> ><br>

> > Happy Thanksgiving.<br>

> ><br>

> > Vishnu.<br>

> ><br>

> ><br>

> ><br>

> > *From:*Angus Salkeld [mailto:<a href="mailto:asalkeld@mirantis.com">asalkeld@mirantis.com</a>]<br>

> > *Sent:* Thursday, November 27, 2014 9:46 AM<br>

> > *To:* OpenStack Development Mailing List (not for usage questions)<br>

> > *Subject:* Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown<br>

> ><br>

> ><br>

> ><br>

> > On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter <<a href="mailto:zbitter@redhat.com">zbitter@redhat.com</a><br>

> > <mailto:<a href="mailto:zbitter@redhat.com">zbitter@redhat.com</a>>> wrote:<br>

> ><br>

> >     A bunch of us have spent the last few weeks working independently on<br>

> >     proof of concept designs for the convergence architecture. I think<br>

> >     those efforts have now reached a sufficient level of maturity that<br>

> >     we should start working together on synthesising them into a plan<br>

> >     that everyone can forge ahead with. As a starting point I'm going to<br>

> >     summarise my take on the three efforts; hopefully the authors of the<br>

> >     other two will weigh in to give us their perspective.<br>

> ><br>

> ><br>

> >     Zane's Proposal<br>

> >     ===============<br>

> ><br>

> >     <a href="https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph" target="_blank">https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph</a><br>

> ><br>

> >     I implemented this as a simulator of the algorithm rather than using<br>

> >     the Heat codebase itself in order to be able to iterate rapidly on<br>

> >     the design, and indeed I have changed my mind many, many times in<br>

> >     the process of implementing it. Its notable departure from a<br>

> >     realistic simulation is that it runs only one operation at a time -<br>

> >     essentially giving up the ability to detect race conditions in<br>

> >     exchange for a completely deterministic test framework. You just<br>

> >     have to imagine where the locks need to be. Incidentally, the test<br>

> >     framework is designed so that it can easily be ported to the actual<br>

> >     Heat code base as functional tests so that the same scenarios could<br>

> >     be used without modification, allowing us to have confidence that<br>

> >     the eventual implementation is a faithful replication of the<br>

> >     simulation (which can be rapidly experimented on, adjusted and<br>

> >     tested when we inevitably run into implementation issues).<br>

> ><br>

> >     This is a complete implementation of Phase 1 (i.e. using existing<br>

> >     resource plugins), including update-during-update, resource<br>

> >     clean-up, replace on update and rollback; with tests.<br>

> ><br>

> >     Some of the design goals which were successfully incorporated:<br>

> >     - Minimise changes to Heat (it's essentially a distributed version<br>

> >     of the existing algorithm), and in particular to the database<br>

> >     - Work with the existing plugin API<br>

> >     - Limit total DB access for Resource/Stack to O(n) in the number of<br>

> >     resources<br>

> >     - Limit overall DB access to O(m) in the number of edges<br>

> >     - Limit lock contention to only those operations actually contending<br>

> >     (i.e. no global locks)<br>

> >     - Each worker task deals with only one resource<br>

> >     - Only read resource attributes once<br>

> ><br>

> ><br>

> >     Open questions:<br>

> >     - What do we do when we encounter a resource that is in progress<br>

> >     from a previous update while doing a subsequent update? Obviously we<br>

> >     don't want to interrupt it, as it will likely be left in an unknown<br>

> >     state. Making a replacement is one obvious answer, but in many cases<br>

> >     there could be serious down-sides to that. How long should we wait<br>

> >     before trying it? What if it's still in progress because the engine<br>

> >     processing the resource already died?<br>

> ><br>

> ><br>

> ><br>

> > Also, how do we implement resource level timeouts in general?<br>

> ><br>

> ><br>

> ><br>

> ><br>

> >     Michał's Proposal<br>

> >     =================<br>

> ><br>

> >     <a href="https://github.com/inc0/heat-convergence-prototype/tree/iterative" target="_blank">https://github.com/inc0/heat-convergence-prototype/tree/iterative</a><br>

> ><br>

> >     Note that a version modified by me to use the same test scenario<br>

> >     format (but not the same scenarios) is here:<br>

> ><br>

> >     <a href="https://github.com/zaneb/heat-convergence-prototype/tree/iterative-adapted" target="_blank">https://github.com/zaneb/heat-convergence-prototype/tree/iterative-adapted</a><br>

> ><br>

> >     This is based on my simulation framework after a fashion, but with<br>

> >     everything implemented synchronously and a lot of handwaving about<br>

> >     how the actual implementation could be distributed. The central<br>

> >     premise is that at each step of the algorithm, the entire graph is<br>

> >     examined for tasks that can be performed next, and those are then<br>

> >     started. Once all are complete (it's synchronous, remember), the<br>

> >     next step is run. Keen observers will be asking how we know when it<br>

> >     is time to run the next step in a distributed version of this<br>

> >     algorithm, where it will be run and what to do about resources that<br>

> >     are in an intermediate state at that time. All of these questions<br>

> >     remain unanswered.<br>

> ><br>

> ><br>

> ><br>

> > Yes, I was struggling to figure out how it could manage an IN_PROGRESS<br>

> > state as it's stateless. So you end up treading on the other action's toes.<br>

> ><br>

> > Assuming we use the resource's state (IN_PROGRESS) you could get around<br>

> > that. Then you kick off a converge when ever an action completes (if<br>

> > there is nothing new to be<br>

> ><br>

> > done then do nothing).<br>

> ><br>

> ><br>

> ><br>

> ><br>

> >     A non-exhaustive list of concerns I have:<br>

> >     - Replace on update is not implemented yet<br>

> >     - AFAIK rollback is not implemented yet<br>

> >     - The simulation doesn't actually implement the proposed architecture<br>

> >     - This approach is punishingly heavy on the database - O(n^2) or worse<br>

> ><br>

> ><br>

> ><br>

> > Yes, re-reading the state of all resources when ever run a new converge<br>

> > is worrying, but I think Michal had some ideas to minimize this.<br>

> ><br>

> ><br>

> ><br>

> >     - A lot of phase 2 is mixed in with phase 1 here, making it<br>

> >     difficult to evaluate which changes need to be made first and<br>

> >     whether this approach works with existing plugins<br>

> >     - The code is not really based on how Heat works at the moment, so<br>

> >     there would be either a major redesign required or lots of radical<br>

> >     changes in Heat or both<br>

> ><br>

> >     I think there's a fair chance that given another 3-4 weeks to work<br>

> >     on this, all of these issues and others could probably be resolved.<br>

> >     The question for me at this point is not so much "if" but "why".<br>

> ><br>

> >     Michał believes that this approach will make Phase 2 easier to<br>

> >     implement, which is a valid reason to consider it. However, I'm not<br>

> >     aware of any particular issues that my approach would cause in<br>

> >     implementing phase 2 (note that I have barely looked into it at all<br>

> >     though). In fact, I very much want Phase 2 to be entirely<br>

> >     encapsulated by the Resource class, so that the plugin type (legacy<br>

> >     vs. convergence-enabled) is transparent to the rest of the system.<br>

> >     Only in this way can we be sure that we'll be able to maintain<br>

> >     support for legacy plugins. So a phase 1 that mixes in aspects of<br>

> >     phase 2 is actually a bad thing in my view.<br>

> ><br>

> >     I really appreciate the effort that has gone into this already, but<br>

> >     in the absence of specific problems with building phase 2 on top of<br>

> >     another approach that are solved by this one, I'm ready to call this<br>

> >     a distraction.<br>

> ><br>

> ><br>

> ><br>

> > In it's defence, I like the simplicity of it. The concepts and code are<br>

> > easy to understand - tho' part of this is doesn't implement all the<br>

> > stuff on your list yet.<br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> >     Anant & Friends' Proposal<br>

> >     =========================<br>

> ><br>

> >     First off, I have found this very difficult to review properly since<br>

> >     the code is not separate from the huge mass of Heat code and nor is<br>

> >     the commit history in the form that patch submissions would take<br>

> >     (but rather includes backtracking and iteration on the design). As a<br>

> >     result, most of the information here has been gleaned from<br>

> >     discussions about the code rather than direct review. I have<br>

> >     repeatedly suggested that this proof of concept work should be done<br>

> >     using the simulator framework instead, unfortunately so far to no avail.<br>

> ><br>

> >     The last we heard on the mailing list about this, resource clean-up<br>

> >     had not yet been implemented. That was a major concern because that<br>

> >     is the more difficult half of the algorithm. Since then there have<br>

> >     been a lot more commits, but it's not yet clear whether resource<br>

> >     clean-up, update-during-update, replace-on-update and rollback have<br>

> >     been implemented, though it is clear that at least some progress has<br>

> >     been made on most or all of them. Perhaps someone can give us an update.<br>

> ><br>

> ><br>

> > <a href="https://github.com/anantpatil/heat-convergence-poc" target="_blank">https://github.com/anantpatil/heat-convergence-poc</a><br>

> ><br>

> ><br>

> ><br>

> >     AIUI this code also mixes phase 2 with phase 1, which is a concern.<br>

> >     For me the highest priority for phase 1 is to be sure that it works<br>

> >     with existing plugins. Not only because we need to continue to<br>

> >     support them, but because converting all of our existing<br>

> >     'integration-y' unit tests to functional tests that operate in a<br>

> >     distributed system is virtually impossible in the time frame we have<br>

> >     available. So the existing test code needs to stick around, and the<br>

> >     existing stack create/update/delete mechanisms need to remain in<br>

> >     place until such time as we have equivalent functional test coverage<br>

> >     to begin eliminating existing unit tests. (We'll also, of course,<br>

> >     need to have unit tests for the individual elements of the new<br>

> >     distributed workflow, functional tests to confirm that the<br>

> >     distributed workflow works in principle as a whole - the scenarios<br>

> >     from the simulator can help with _part_ of this - and, not least, an<br>

> >     algorithm that is as similar as possible to the current one so that<br>

> >     our existing tests remain at least somewhat representative and don't<br>

> >     require too many major changes themselves.)<br>

> ><br>

> >     Speaking of tests, I gathered that this branch included tests, but I<br>

> >     don't know to what extent there are automated end-to-end functional<br>

> >     tests of the algorithm?<br>

> ><br>

> >     From what I can gather, the approach seems broadly similar to the<br>

> >     one I eventually settled on also. The major difference appears to be<br>

> >     in how we merge two or more streams of execution (i.e. when one<br>

> >     resource depends on two or more others). In my approach, the<br>

> >     dependencies are stored in the resources and each joining of streams<br>

> >     creates a database row to track it, which is easily locked with<br>

> >     contention on the lock extending only to those resources which are<br>

> >     direct dependencies of the one waiting. In this approach, both the<br>

> >     dependencies and the progress through the graph are stored in a<br>

> >     database table, necessitating (a) reading of the entire table (as it<br>

> >     relates to the current stack) on every resource operation, and (b)<br>

> >     locking of the entire table (which is hard) when marking a resource<br>

> >     operation complete.<br>

> ><br>

> >     I chatted to Anant about this today and he mentioned that they had<br>

> >     solved the locking problem by dispatching updates to a queue that is<br>

> >     read by a single engine per stack.<br>

> ><br>

> >     My approach also has the neat side-effects of pushing the data<br>

> >     required to resolve get_resource and get_att (without having to<br>

> >     reload the resources again and query them) as well as to update<br>

> >     dependencies (e.g. because of a replacement or deletion) along with<br>

> >     the flow of triggers. I don't know if anything similar is at work here.<br>

> ><br>

> >     It's entirely possible that the best design might combine elements<br>

> >     of both approaches.<br>

> ><br>

> >     The same open questions I detailed under my proposal also apply to<br>

> >     this one, if I understand correctly.<br>

> ><br>

> ><br>

> >     I'm certain that I won't have represented everyone's work fairly<br>

> >     here, so I encourage folks to dive in and correct any errors about<br>

> >     theirs and ask any questions you might have about mine. (In case you<br>

> >     have been living under a rock, note that I'll be out of the office<br>

> >     for the rest of the week due to Thanksgiving so don't expect<br>

> >     immediate replies.)<br>

> ><br>

> >     I also think this would be a great time for the wider Heat community<br>

> >     to dive in and start asking questions and suggesting ideas. We need<br>

> >     to, ahem, converge on a shared understanding of the design so we can<br>

> >     all get to work delivering it for Kilo.<br>

> ><br>

> ><br>

> ><br>

> > Agree, we need to get moving on this.<br>

> ><br>

> > -Angus<br>

> ><br>

> ><br>

> ><br>

> >     cheers,<br>

> >     Zane.<br>

> ><br>

> >     _______________________________________________<br>

> >     OpenStack-dev mailing list<br>

> >     <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

> >     <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>><br>

> >     <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

> ><br>

> ><br>

> ><br>

> ><br>

> ><br>

> > _______________________________________________<br>

> > OpenStack-dev mailing list<br>

> > <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

> > <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

> ><br>

><br>

> Thanks Zane for your e-mail Zane and summarizing everyone's work.<br>

><br>

> The design goals mentioned above looks more of performance goals and<br>

> constraints to me. I understand that it is unacceptable to have a poorly<br>

> performing engine and Resource plug-ins broken. Convergence spec clearly<br>

> mentions that the existing Resource plugins should not be changed.<br>

><br>

> IMHO, and my teams' HO, the design goals of convergence would be:<br>

> 1. Stability: No transient failures, either in Openstack/external<br>

> services or resources themselves, should fail the stack. Therefore, we<br>

> need to have Observers to check for divergence and converge a resource<br>

> if needed, to bring back to stable state.<br>

> 2. Resiliency: Heat engines should be able to take up tasks in case of<br>

> failures/restarts.<br>

> 3. Backward compatibility: "We don't break the user space." No existing<br>

> stacks should break.<br>

><br>

> We started the PoC with these goals in mind, any performance<br>

> optimization would be a plus point for us. Note than I am neglecting the<br>

> performance goal, just that it should be next in the pipeline. The<br>

> questions we should ask ourselves is whether we are storing enough data<br>

> (state of stack) in DB to enable resiliency? Are we distributing the<br>

> load evenly to all Heat engines? Does our notification mechanism<br>

> provides us some form of guarantee or acknowledgement?<br>

><br>

> In retrospective, we had to struggle a lot to understand the existing<br>

> Heat engine. We couldn't have done justice by just creating another<br>

> project in GitHub and without any concrete understanding of existing<br>

> state-of-affairs. We are not at the same page with Heat core members, we<br>

> are novice and cores are experts.<br>

><br>

> I am glad that we experimented with the Heat engine directly. The<br>

> current Heat engine is not resilient and the messaging also lacks<br>

> reliability. We (my team and I guess cores also) understand that async<br>

> message passing would be the way to go as synchronous RPC calls simply<br>

> wouldn't scale. But with async message passing there has to be some<br>

> mechanism of ACKing back, which I think lacks in current infrastructure.<br>

><br>

> How could we provide stable user defined stack if the underlying Heat<br>

> core lacks it? Convergence is all about stable stacks. To make the<br>

> current Heat core stable we need to have, at the least:<br>

> 1. Some mechanism to ACK back messages over AMQP. Or some other solid<br>

> mechanism of message passing.<br>

> 2. Some mechanism for fault tolerance in Heat engine using external<br>

> tools/infrastructure like Celerey/Zookeeper. Without external<br>

> infrastructure/tool we will end-up bloating Heat engine with lot of<br>

> boiler-plate code to achieve this. We had recommended Celery in our<br>

> previous e-mail (from Vishnu.)<br>

><br>

> It was due to our experiments with Heat engines for this PoC, we could<br>

> come up with above recommendations.<br>

><br>

> Sate of our PoC<br>

> ---------------<br>

><br>

> On GitHub: <a href="https://github.com/anantpatil/heat-convergence-poc" target="_blank">https://github.com/anantpatil/heat-convergence-poc</a><br>

><br>

> Our current implementation of PoC locks the stack after each<br>

> notification to mark the graph as traversed and produce next level of<br>

> resources for convergence. We are facing challenges in<br>

> removing/minimizing these locks. We also have two different school of<br>

> thoughts for solving this lock issue as mentioned above in Vishnu's<br>

> e-mail. I will these descibe in detail the Wiki. There would different<br>

> branches in our GitHub for these two approaches.<br>

><br>

<br>

</div></div>It would be helpful if you explained why you need to _lock_ the stack.<br>

MVCC in the database should be enough here. Basically you need to:<br>

<br>

begin transaction<br>

update traversal information<br>

select resolvable nodes<br>

{in code not sql -- send converge commands into async queue}<br>

commit<br>

<br>

Any failure inside this transation should rollback the transaction and<br>

retry this. It is o-k to have duplicate converge commands for a resource.<br>

<br>

This should be the single point of synchronization between workers that<br>

are resolving resources. Or perhaps this is the lock you meant? Either<br>

way, this isn't avoidable if you want to make sure everything is attempted<br>

at least once without having to continuously poll and re-poll the stack<br>

to look for unresolved resources. That is an option, but not one that I<br>

think is going to be as simple as the transactional method.<br>

<div class="HOEnZb"><div class="h5"><br>

_______________________________________________<br>

OpenStack-dev mailing list<br>

<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>

</div></div></blockquote></div><br></div>