<div dir="ltr"><div>When the stack fails, it is marked as FAILED and all the sync points</div><div>that are needed to trigger the next set of resources are deleted. The</div><div>resources at same level in the graph, like here, they are suppose to</div><div>timeout or fail for an exception. Many DB hits means that the cache</div><div>data we were maintaining is not being used in the way we intended.</div><div><br></div><div>I don't see if really need 1; if it works with legacy w/o putting any such</div><div>constraints, it should work with convergence as well.</div><div><br></div><div>--</div><div>Anant</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Sep 2, 2015 at 5:23 AM, Angus Salkeld <span dir="ltr"><<a href="mailto:asalkeld@mirantis.com" target="_blank">asalkeld@mirantis.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div><div class="h5"><div dir="ltr">On Tue, Sep 1, 2015 at 10:45 PM Steven Hardy <<a href="mailto:shardy@redhat.com" target="_blank">shardy@redhat.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Fri, Aug 28, 2015 at 01:35:52AM +0000, Angus Salkeld wrote:<br>
> Hi<br>
> I have been running some rally tests against convergence and our existing<br>
> implementation to compare.<br>
> So far I have done the following:<br>
> 1. defined a template with a resource<br>
> groupA <a href="https://github.com/asalkeld/convergence-rally/blob/master/templates/resource_group_test_resource.yaml.template" rel="noreferrer" target="_blank">https://github.com/asalkeld/convergence-rally/blob/master/templates/resource_group_test_resource.yaml.template</a><br>
> 2. the inner resource looks like<br>
> this:A <a href="https://github.com/asalkeld/convergence-rally/blob/master/templates/server_with_volume.yaml.templateA" rel="noreferrer" target="_blank">https://github.com/asalkeld/convergence-rally/blob/master/templates/server_with_volume.yaml.templateA</a> (it<br>
> uses TestResource to attempt to be a reasonable simulation of a<br>
> server+volume+floatingip)<br>
> 3. defined a rally<br>
> job:A <a href="https://github.com/asalkeld/convergence-rally/blob/master/increasing_resources.yamlA" rel="noreferrer" target="_blank">https://github.com/asalkeld/convergence-rally/blob/master/increasing_resources.yamlA</a> that<br>
> creates X resources then updates to X*2 then deletes.<br>
> 4. I then ran the above with/without convergence and with 2,4,8<br>
> heat-engines<br>
> Here are the results compared:<br>
> <a href="https://docs.google.com/spreadsheets/d/12kRtPsmZBl_y78aw684PTBg3op1ftUYsAEqXBtT800A/edit?usp=sharing" rel="noreferrer" target="_blank">https://docs.google.com/spreadsheets/d/12kRtPsmZBl_y78aw684PTBg3op1ftUYsAEqXBtT800A/edit?usp=sharing</a><br>
> Some notes on the results so far:<br>
> * A convergence with only 2 engines does suffer from RPC overload (it<br>
> gets message timeouts on larger templates). I wonder if this is the<br>
> problem in our convergence gate...<br>
> * convergence does very well with a reasonable number of engines<br>
> running.<br>
> * delete is slightly slower on convergence<br>
> Still to test:<br>
> * the above, but measure memory usage<br>
> * many small templates (run concurrently)<br>
<br>
So, I tried running my many-small-templates here with convergence enabled:<br>
<br>
<a href="https://bugs.launchpad.net/heat/+bug/1489548" rel="noreferrer" target="_blank">https://bugs.launchpad.net/heat/+bug/1489548</a><br>
<br>
In heat.conf I set:<br>
<br>
max_resources_per_stack = -1<br>
convergence_engine = true<br>
<br>
Most other settings (particularly RPC and DB settings) are defaults.<br>
<br>
Without convergence (but with max_resources_per_stack disabled) I see the<br>
time to create a ResourceGroup of 400 nested stacks (each containing one<br>
RandomString resource) is about 2.5 minutes (core i7 laptop w/SSD, 4 heat<br>
workers e.g the default for a 4 core machine).<br>
<br>
With convergence enabled, I see these errors from sqlalchemy:<br>
<br>
File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 652, in<br>
_checkout\n fairy = _ConnectionRecord.checkout(pool)\n', u' File<br>
"/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 444, in<br>
checkout\n rec = pool._do_get()\n', u' File<br>
"/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 980, in<br>
_do_get\n (self.size(), self.overflow(), self._timeout))\n',<br>
u'TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection<br>
timed out, timeout 30\n'].<br>
<br>
I assume this means we're loading the DB much more in the convergence case<br>
and overflowing the QueuePool?<br></blockquote><div><br></div></div></div><div>Yeah, looks like it.</div><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
This seems to happen when the RPC call from the ResourceGroup tries to<br>
create some of the 400 nested stacks.<br>
<br>
Interestingly after this error, the parent stack moves to CREATE_FAILED,<br>
but the engine remains (very) busy, to the point of being partially<br>
responsive, so it looks like maybe the cancel-on-fail isnt' working (I'm<br>
assuming it isn't error_wait_time because the parent stack has been marked<br>
FAILED and I'm pretty sure it's been more than 240s).<br>
<br>
I'll dig a bit deeper when I get time, but for now you might like to try<br>
the stress test too. It's a bit of a synthetic test, but it turns out to<br>
be a reasonable proxy for some performance issues we observed when creating<br>
large-ish TripleO deployments (which also create a large number of nested<br>
stacks concurrently).<br></blockquote><div><br></div></span><div>Thanks a lot for testing Steve! I'll make 2 bugs for what you have raised</div><div>1. limit the number of resource actions in parallel (maybe base on the number of cores)</div><div>2. the cancel on fail error</div><span class="HOEnZb"><font color="#888888"><div><br></div><div>-Angus</div></font></span><span class=""><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
Steve<br>
<br>
__________________________________________________________________________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
</blockquote></span></div></div>
<br>__________________________________________________________________________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.openstack.org?subject:unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev</a><br>
<br></blockquote></div><br></div>