<div dir="ltr"><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Dec 2, 2014 at 1:49 AM, Clint Byrum <span dir="ltr"><<a href="mailto:clint@fewbar.com" target="_blank">clint@fewbar.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Excerpts from Anant Patil's message of 2014-11-30 23:02:29 -0800:<br>
<div><div class="h5">> On 27-Nov-14 18:03, Murugan, Visnusaran wrote:<br>
> > Hi Zane,<br>
> ><br>
> ><br>
> ><br>
> > At this stage our implementation (as mentioned in wiki<br>
> > <<a href="https://wiki.openstack.org/wiki/Heat/ConvergenceDesign" target="_blank">https://wiki.openstack.org/wiki/Heat/ConvergenceDesign</a>>) achieves your<br>
> > design goals.<br>
> ><br>
> ><br>
> ><br>
> > 1. In case of a parallel update, our implementation adjusts graph<br>
> > according to new template and waits for dispatched resource tasks to<br>
> > complete.<br>
> ><br>
> > 2. Reason for basing our PoC on Heat code:<br>
> ><br>
> > a. To solve contention processing parent resource by all dependent<br>
> > resources in parallel.<br>
> ><br>
> > b. To avoid porting issue from PoC to HeatBase. (just to be aware<br>
> > of potential issues asap)<br>
> ><br>
> > 3. Resource timeout would be helpful, but I guess its resource<br>
> > specific and has to come from template and default values from plugins.<br>
> ><br>
> > 4. We see resource notification aggregation and processing next<br>
> > level of resources without contention and with minimal DB usage as the<br>
> > problem area. We are working on the following approaches in *parallel.*<br>
> ><br>
> > a. Use a Queue per stack to serialize notification.<br>
> ><br>
> > b. Get parent ProcessLog (ResourceID, EngineID) and initiate<br>
> > convergence upon first child notification. Subsequent children who fail<br>
> > to get parent resource lock will directly send message to waiting parent<br>
> > task (topic=stack_id.parent_resource_id)<br>
> ><br>
> > Based on performance/feedback we can select either or a mashed version.<br>
> ><br>
> ><br>
> ><br>
> > Advantages:<br>
> ><br>
> > 1. Failed Resource tasks can be re-initiated after ProcessLog<br>
> > table lookup.<br>
> ><br>
> > 2. One worker == one resource.<br>
> ><br>
> > 3. Supports concurrent updates<br>
> ><br>
> > 4. Delete == update with empty stack<br>
> ><br>
> > 5. Rollback == update to previous know good/completed stack.<br>
> ><br>
> ><br>
> ><br>
> > Disadvantages:<br>
> ><br>
> > 1. Still holds stackLock (WIP to remove with ProcessLog)<br>
> ><br>
> ><br>
> ><br>
> > Completely understand your concern on reviewing our code, since commits<br>
> > are numerous and there is change of course at places. Our start commit<br>
> > is [c1b3eb22f7ab6ea60b095f88982247dd249139bf] though this might not help J<br>
> ><br>
> ><br>
> ><br>
> > Your Thoughts.<br>
> ><br>
> ><br>
> ><br>
> > Happy Thanksgiving.<br>
> ><br>
> > Vishnu.<br>
> ><br>
> ><br>
> ><br>
> > *From:*Angus Salkeld [mailto:<a href="mailto:asalkeld@mirantis.com">asalkeld@mirantis.com</a>]<br>
> > *Sent:* Thursday, November 27, 2014 9:46 AM<br>
> > *To:* OpenStack Development Mailing List (not for usage questions)<br>
> > *Subject:* Re: [openstack-dev] [Heat] Convergence proof-of-concept showdown<br>
> ><br>
> ><br>
> ><br>
> > On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter <<a href="mailto:zbitter@redhat.com">zbitter@redhat.com</a><br>
> > <mailto:<a href="mailto:zbitter@redhat.com">zbitter@redhat.com</a>>> wrote:<br>
> ><br>
> > A bunch of us have spent the last few weeks working independently on<br>
> > proof of concept designs for the convergence architecture. I think<br>
> > those efforts have now reached a sufficient level of maturity that<br>
> > we should start working together on synthesising them into a plan<br>
> > that everyone can forge ahead with. As a starting point I'm going to<br>
> > summarise my take on the three efforts; hopefully the authors of the<br>
> > other two will weigh in to give us their perspective.<br>
> ><br>
> ><br>
> > Zane's Proposal<br>
> > ===============<br>
> ><br>
> > <a href="https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph" target="_blank">https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph</a><br>
> ><br>
> > I implemented this as a simulator of the algorithm rather than using<br>
> > the Heat codebase itself in order to be able to iterate rapidly on<br>
> > the design, and indeed I have changed my mind many, many times in<br>
> > the process of implementing it. Its notable departure from a<br>
> > realistic simulation is that it runs only one operation at a time -<br>
> > essentially giving up the ability to detect race conditions in<br>
> > exchange for a completely deterministic test framework. You just<br>
> > have to imagine where the locks need to be. Incidentally, the test<br>
> > framework is designed so that it can easily be ported to the actual<br>
> > Heat code base as functional tests so that the same scenarios could<br>
> > be used without modification, allowing us to have confidence that<br>
> > the eventual implementation is a faithful replication of the<br>
> > simulation (which can be rapidly experimented on, adjusted and<br>
> > tested when we inevitably run into implementation issues).<br>
> ><br>
> > This is a complete implementation of Phase 1 (i.e. using existing<br>
> > resource plugins), including update-during-update, resource<br>
> > clean-up, replace on update and rollback; with tests.<br>
> ><br>
> > Some of the design goals which were successfully incorporated:<br>
> > - Minimise changes to Heat (it's essentially a distributed version<br>
> > of the existing algorithm), and in particular to the database<br>
> > - Work with the existing plugin API<br>
> > - Limit total DB access for Resource/Stack to O(n) in the number of<br>
> > resources<br>
> > - Limit overall DB access to O(m) in the number of edges<br>
> > - Limit lock contention to only those operations actually contending<br>
> > (i.e. no global locks)<br>
> > - Each worker task deals with only one resource<br>
> > - Only read resource attributes once<br>
> ><br>
> ><br>
> > Open questions:<br>
> > - What do we do when we encounter a resource that is in progress<br>
> > from a previous update while doing a subsequent update? Obviously we<br>
> > don't want to interrupt it, as it will likely be left in an unknown<br>
> > state. Making a replacement is one obvious answer, but in many cases<br>
> > there could be serious down-sides to that. How long should we wait<br>
> > before trying it? What if it's still in progress because the engine<br>
> > processing the resource already died?<br>
> ><br>
> ><br>
> ><br>
> > Also, how do we implement resource level timeouts in general?<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > Michał's Proposal<br>
> > =================<br>
> ><br>
> > <a href="https://github.com/inc0/heat-convergence-prototype/tree/iterative" target="_blank">https://github.com/inc0/heat-convergence-prototype/tree/iterative</a><br>
> ><br>
> > Note that a version modified by me to use the same test scenario<br>
> > format (but not the same scenarios) is here:<br>
> ><br>
> > <a href="https://github.com/zaneb/heat-convergence-prototype/tree/iterative-adapted" target="_blank">https://github.com/zaneb/heat-convergence-prototype/tree/iterative-adapted</a><br>
> ><br>
> > This is based on my simulation framework after a fashion, but with<br>
> > everything implemented synchronously and a lot of handwaving about<br>
> > how the actual implementation could be distributed. The central<br>
> > premise is that at each step of the algorithm, the entire graph is<br>
> > examined for tasks that can be performed next, and those are then<br>
> > started. Once all are complete (it's synchronous, remember), the<br>
> > next step is run. Keen observers will be asking how we know when it<br>
> > is time to run the next step in a distributed version of this<br>
> > algorithm, where it will be run and what to do about resources that<br>
> > are in an intermediate state at that time. All of these questions<br>
> > remain unanswered.<br>
> ><br>
> ><br>
> ><br>
> > Yes, I was struggling to figure out how it could manage an IN_PROGRESS<br>
> > state as it's stateless. So you end up treading on the other action's toes.<br>
> ><br>
> > Assuming we use the resource's state (IN_PROGRESS) you could get around<br>
> > that. Then you kick off a converge when ever an action completes (if<br>
> > there is nothing new to be<br>
> ><br>
> > done then do nothing).<br>
> ><br>
> ><br>
> ><br>
> ><br>
> > A non-exhaustive list of concerns I have:<br>
> > - Replace on update is not implemented yet<br>
> > - AFAIK rollback is not implemented yet<br>
> > - The simulation doesn't actually implement the proposed architecture<br>
> > - This approach is punishingly heavy on the database - O(n^2) or worse<br>
> ><br>
> ><br>
> ><br>
> > Yes, re-reading the state of all resources when ever run a new converge<br>
> > is worrying, but I think Michal had some ideas to minimize this.<br>
> ><br>
> ><br>
> ><br>
> > - A lot of phase 2 is mixed in with phase 1 here, making it<br>
> > difficult to evaluate which changes need to be made first and<br>
> > whether this approach works with existing plugins<br>
> > - The code is not really based on how Heat works at the moment, so<br>
> > there would be either a major redesign required or lots of radical<br>
> > changes in Heat or both<br>
> ><br>
> > I think there's a fair chance that given another 3-4 weeks to work<br>
> > on this, all of these issues and others could probably be resolved.<br>
> > The question for me at this point is not so much "if" but "why".<br>
> ><br>
> > Michał believes that this approach will make Phase 2 easier to<br>
> > implement, which is a valid reason to consider it. However, I'm not<br>
> > aware of any particular issues that my approach would cause in<br>
> > implementing phase 2 (note that I have barely looked into it at all<br>
> > though). In fact, I very much want Phase 2 to be entirely<br>
> > encapsulated by the Resource class, so that the plugin type (legacy<br>
> > vs. convergence-enabled) is transparent to the rest of the system.<br>
> > Only in this way can we be sure that we'll be able to maintain<br>
> > support for legacy plugins. So a phase 1 that mixes in aspects of<br>
> > phase 2 is actually a bad thing in my view.<br>
> ><br>
> > I really appreciate the effort that has gone into this already, but<br>
> > in the absence of specific problems with building phase 2 on top of<br>
> > another approach that are solved by this one, I'm ready to call this<br>
> > a distraction.<br>
> ><br>
> ><br>
> ><br>
> > In it's defence, I like the simplicity of it. The concepts and code are<br>
> > easy to understand - tho' part of this is doesn't implement all the<br>
> > stuff on your list yet.<br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > Anant & Friends' Proposal<br>
> > =========================<br>
> ><br>
> > First off, I have found this very difficult to review properly since<br>
> > the code is not separate from the huge mass of Heat code and nor is<br>
> > the commit history in the form that patch submissions would take<br>
> > (but rather includes backtracking and iteration on the design). As a<br>
> > result, most of the information here has been gleaned from<br>
> > discussions about the code rather than direct review. I have<br>
> > repeatedly suggested that this proof of concept work should be done<br>
> > using the simulator framework instead, unfortunately so far to no avail.<br>
> ><br>
> > The last we heard on the mailing list about this, resource clean-up<br>
> > had not yet been implemented. That was a major concern because that<br>
> > is the more difficult half of the algorithm. Since then there have<br>
> > been a lot more commits, but it's not yet clear whether resource<br>
> > clean-up, update-during-update, replace-on-update and rollback have<br>
> > been implemented, though it is clear that at least some progress has<br>
> > been made on most or all of them. Perhaps someone can give us an update.<br>
> ><br>
> ><br>
> > <a href="https://github.com/anantpatil/heat-convergence-poc" target="_blank">https://github.com/anantpatil/heat-convergence-poc</a><br>
> ><br>
> ><br>
> ><br>
> > AIUI this code also mixes phase 2 with phase 1, which is a concern.<br>
> > For me the highest priority for phase 1 is to be sure that it works<br>
> > with existing plugins. Not only because we need to continue to<br>
> > support them, but because converting all of our existing<br>
> > 'integration-y' unit tests to functional tests that operate in a<br>
> > distributed system is virtually impossible in the time frame we have<br>
> > available. So the existing test code needs to stick around, and the<br>
> > existing stack create/update/delete mechanisms need to remain in<br>
> > place until such time as we have equivalent functional test coverage<br>
> > to begin eliminating existing unit tests. (We'll also, of course,<br>
> > need to have unit tests for the individual elements of the new<br>
> > distributed workflow, functional tests to confirm that the<br>
> > distributed workflow works in principle as a whole - the scenarios<br>
> > from the simulator can help with _part_ of this - and, not least, an<br>
> > algorithm that is as similar as possible to the current one so that<br>
> > our existing tests remain at least somewhat representative and don't<br>
> > require too many major changes themselves.)<br>
> ><br>
> > Speaking of tests, I gathered that this branch included tests, but I<br>
> > don't know to what extent there are automated end-to-end functional<br>
> > tests of the algorithm?<br>
> ><br>
> > From what I can gather, the approach seems broadly similar to the<br>
> > one I eventually settled on also. The major difference appears to be<br>
> > in how we merge two or more streams of execution (i.e. when one<br>
> > resource depends on two or more others). In my approach, the<br>
> > dependencies are stored in the resources and each joining of streams<br>
> > creates a database row to track it, which is easily locked with<br>
> > contention on the lock extending only to those resources which are<br>
> > direct dependencies of the one waiting. In this approach, both the<br>
> > dependencies and the progress through the graph are stored in a<br>
> > database table, necessitating (a) reading of the entire table (as it<br>
> > relates to the current stack) on every resource operation, and (b)<br>
> > locking of the entire table (which is hard) when marking a resource<br>
> > operation complete.<br>
> ><br>
> > I chatted to Anant about this today and he mentioned that they had<br>
> > solved the locking problem by dispatching updates to a queue that is<br>
> > read by a single engine per stack.<br>
> ><br>
> > My approach also has the neat side-effects of pushing the data<br>
> > required to resolve get_resource and get_att (without having to<br>
> > reload the resources again and query them) as well as to update<br>
> > dependencies (e.g. because of a replacement or deletion) along with<br>
> > the flow of triggers. I don't know if anything similar is at work here.<br>
> ><br>
> > It's entirely possible that the best design might combine elements<br>
> > of both approaches.<br>
> ><br>
> > The same open questions I detailed under my proposal also apply to<br>
> > this one, if I understand correctly.<br>
> ><br>
> ><br>
> > I'm certain that I won't have represented everyone's work fairly<br>
> > here, so I encourage folks to dive in and correct any errors about<br>
> > theirs and ask any questions you might have about mine. (In case you<br>
> > have been living under a rock, note that I'll be out of the office<br>
> > for the rest of the week due to Thanksgiving so don't expect<br>
> > immediate replies.)<br>
> ><br>
> > I also think this would be a great time for the wider Heat community<br>
> > to dive in and start asking questions and suggesting ideas. We need<br>
> > to, ahem, converge on a shared understanding of the design so we can<br>
> > all get to work delivering it for Kilo.<br>
> ><br>
> ><br>
> ><br>
> > Agree, we need to get moving on this.<br>
> ><br>
> > -Angus<br>
> ><br>
> ><br>
> ><br>
> > cheers,<br>
> > Zane.<br>
> ><br>
> > _______________________________________________<br>
> > OpenStack-dev mailing list<br>
> > <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> > <mailto:<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a>><br>
> > <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
> ><br>
> ><br>
> ><br>
> ><br>
> ><br>
> > _______________________________________________<br>
> > OpenStack-dev mailing list<br>
> > <a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
> > <a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
> ><br>
><br>
> Thanks Zane for your e-mail Zane and summarizing everyone's work.<br>
><br>
> The design goals mentioned above looks more of performance goals and<br>
> constraints to me. I understand that it is unacceptable to have a poorly<br>
> performing engine and Resource plug-ins broken. Convergence spec clearly<br>
> mentions that the existing Resource plugins should not be changed.<br>
><br>
> IMHO, and my teams' HO, the design goals of convergence would be:<br>
> 1. Stability: No transient failures, either in Openstack/external<br>
> services or resources themselves, should fail the stack. Therefore, we<br>
> need to have Observers to check for divergence and converge a resource<br>
> if needed, to bring back to stable state.<br>
> 2. Resiliency: Heat engines should be able to take up tasks in case of<br>
> failures/restarts.<br>
> 3. Backward compatibility: "We don't break the user space." No existing<br>
> stacks should break.<br>
><br>
> We started the PoC with these goals in mind, any performance<br>
> optimization would be a plus point for us. Note than I am neglecting the<br>
> performance goal, just that it should be next in the pipeline. The<br>
> questions we should ask ourselves is whether we are storing enough data<br>
> (state of stack) in DB to enable resiliency? Are we distributing the<br>
> load evenly to all Heat engines? Does our notification mechanism<br>
> provides us some form of guarantee or acknowledgement?<br>
><br>
> In retrospective, we had to struggle a lot to understand the existing<br>
> Heat engine. We couldn't have done justice by just creating another<br>
> project in GitHub and without any concrete understanding of existing<br>
> state-of-affairs. We are not at the same page with Heat core members, we<br>
> are novice and cores are experts.<br>
><br>
> I am glad that we experimented with the Heat engine directly. The<br>
> current Heat engine is not resilient and the messaging also lacks<br>
> reliability. We (my team and I guess cores also) understand that async<br>
> message passing would be the way to go as synchronous RPC calls simply<br>
> wouldn't scale. But with async message passing there has to be some<br>
> mechanism of ACKing back, which I think lacks in current infrastructure.<br>
><br>
> How could we provide stable user defined stack if the underlying Heat<br>
> core lacks it? Convergence is all about stable stacks. To make the<br>
> current Heat core stable we need to have, at the least:<br>
> 1. Some mechanism to ACK back messages over AMQP. Or some other solid<br>
> mechanism of message passing.<br>
> 2. Some mechanism for fault tolerance in Heat engine using external<br>
> tools/infrastructure like Celerey/Zookeeper. Without external<br>
> infrastructure/tool we will end-up bloating Heat engine with lot of<br>
> boiler-plate code to achieve this. We had recommended Celery in our<br>
> previous e-mail (from Vishnu.)<br>
><br>
> It was due to our experiments with Heat engines for this PoC, we could<br>
> come up with above recommendations.<br>
><br>
> Sate of our PoC<br>
> ---------------<br>
><br>
> On GitHub: <a href="https://github.com/anantpatil/heat-convergence-poc" target="_blank">https://github.com/anantpatil/heat-convergence-poc</a><br>
><br>
> Our current implementation of PoC locks the stack after each<br>
> notification to mark the graph as traversed and produce next level of<br>
> resources for convergence. We are facing challenges in<br>
> removing/minimizing these locks. We also have two different school of<br>
> thoughts for solving this lock issue as mentioned above in Vishnu's<br>
> e-mail. I will these descibe in detail the Wiki. There would different<br>
> branches in our GitHub for these two approaches.<br>
><br>
<br>
</div></div>It would be helpful if you explained why you need to _lock_ the stack.<br>
MVCC in the database should be enough here. Basically you need to:<br>
<br>
begin transaction<br>
update traversal information<br>
select resolvable nodes<br>
{in code not sql -- send converge commands into async queue}<br>
commit<br>
<br>
Any failure inside this transation should rollback the transaction and<br>
retry this. It is o-k to have duplicate converge commands for a resource.<br>
<br>
This should be the single point of synchronization between workers that<br>
are resolving resources. Or perhaps this is the lock you meant? Either<br>
way, this isn't avoidable if you want to make sure everything is attempted<br>
at least once without having to continuously poll and re-poll the stack<br>
to look for unresolved resources. That is an option, but not one that I<br>
think is going to be as simple as the transactional method.<br>
<div class="HOEnZb"><div class="h5"><br>
_______________________________________________<br>
OpenStack-dev mailing list<br>
<a href="mailto:OpenStack-dev@lists.openstack.org">OpenStack-dev@lists.openstack.org</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</div></div></blockquote></div><br></div>