[openstack-dev] [Heat] convergence rally test results (so far)

Angus Salkeld asalkeld at mirantis.com
Tue Sep 1 23:53:04 UTC 2015


On Tue, Sep 1, 2015 at 10:45 PM Steven Hardy <shardy at redhat.com> wrote:

> On Fri, Aug 28, 2015 at 01:35:52AM +0000, Angus Salkeld wrote:
> >    Hi
> >    I have been running some rally tests against convergence and our
> existing
> >    implementation to compare.
> >    So far I have done the following:
> >     1. defined a template with a resource
> >        groupA
> https://github.com/asalkeld/convergence-rally/blob/master/templates/resource_group_test_resource.yaml.template
> >     2. the inner resource looks like
> >        this:A
> https://github.com/asalkeld/convergence-rally/blob/master/templates/server_with_volume.yaml.templateA
> (it
> >        uses TestResource to attempt to be a reasonable simulation of a
> >        server+volume+floatingip)
> >     3. defined a rally
> >        job:A
> https://github.com/asalkeld/convergence-rally/blob/master/increasing_resources.yamlA
> that
> >        creates X resources then updates to X*2 then deletes.
> >     4. I then ran the above with/without convergence and with 2,4,8
> >        heat-engines
> >    Here are the results compared:
> >
> https://docs.google.com/spreadsheets/d/12kRtPsmZBl_y78aw684PTBg3op1ftUYsAEqXBtT800A/edit?usp=sharing
> >    Some notes on the results so far:
> >      * A convergence with only 2 engines does suffer from RPC overload
> (it
> >        gets message timeouts on larger templates). I wonder if this is
> the
> >        problem in our convergence gate...
> >      * convergence does very well with a reasonable number of engines
> >        running.
> >      * delete is slightly slower on convergence
> >    Still to test:
> >      * the above, but measure memory usage
> >      * many small templates (run concurrently)
>
> So, I tried running my many-small-templates here with convergence enabled:
>
> https://bugs.launchpad.net/heat/+bug/1489548
>
> In heat.conf I set:
>
> max_resources_per_stack = -1
> convergence_engine = true
>
> Most other settings (particularly RPC and DB settings) are defaults.
>
> Without convergence (but with max_resources_per_stack disabled) I see the
> time to create a ResourceGroup of 400 nested stacks (each containing one
> RandomString resource) is about 2.5 minutes (core i7 laptop w/SSD, 4 heat
> workers e.g the default for a 4 core machine).
>
> With convergence enabled, I see these errors from sqlalchemy:
>
> File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 652, in
> _checkout\n    fairy = _ConnectionRecord.checkout(pool)\n', u'  File
> "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 444, in
> checkout\n    rec = pool._do_get()\n', u'  File
> "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 980, in
> _do_get\n    (self.size(), self.overflow(), self._timeout))\n',
> u'TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection
> timed out, timeout 30\n'].
>
> I assume this means we're loading the DB much more in the convergence case
> and overflowing the QueuePool?
>

Yeah, looks like it.


>
> This seems to happen when the RPC call from the ResourceGroup tries to
> create some of the 400 nested stacks.
>
> Interestingly after this error, the parent stack moves to CREATE_FAILED,
> but the engine remains (very) busy, to the point of being partially
> responsive, so it looks like maybe the cancel-on-fail isnt' working (I'm
> assuming it isn't error_wait_time because the parent stack has been marked
> FAILED and I'm pretty sure it's been more than 240s).
>
> I'll dig a bit deeper when I get time, but for now you might like to try
> the stress test too.  It's a bit of a synthetic test, but it turns out to
> be a reasonable proxy for some performance issues we observed when creating
> large-ish TripleO deployments (which also create a large number of nested
> stacks concurrently).
>

Thanks a lot for testing Steve! I'll make 2 bugs for what you have raised
1. limit the number of resource actions in parallel (maybe base on the
number of cores)
2. the cancel on fail error

-Angus


>
> Steve
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150901/ca1ec281/attachment.html>


More information about the OpenStack-dev mailing list