[openstack-dev] [Heat] convergence rally test results (so far)

Steven Hardy shardy at redhat.com
Tue Sep 1 12:41:48 UTC 2015


On Fri, Aug 28, 2015 at 01:35:52AM +0000, Angus Salkeld wrote:
>    Hi
>    I have been running some rally tests against convergence and our existing
>    implementation to compare.
>    So far I have done the following:
>     1. defined a template with a resource
>        groupA https://github.com/asalkeld/convergence-rally/blob/master/templates/resource_group_test_resource.yaml.template
>     2. the inner resource looks like
>        this:A https://github.com/asalkeld/convergence-rally/blob/master/templates/server_with_volume.yaml.templateA (it
>        uses TestResource to attempt to be a reasonable simulation of a
>        server+volume+floatingip)
>     3. defined a rally
>        job:A https://github.com/asalkeld/convergence-rally/blob/master/increasing_resources.yamlA that
>        creates X resources then updates to X*2 then deletes.
>     4. I then ran the above with/without convergence and with 2,4,8
>        heat-engines
>    Here are the results compared:
>    https://docs.google.com/spreadsheets/d/12kRtPsmZBl_y78aw684PTBg3op1ftUYsAEqXBtT800A/edit?usp=sharing
>    Some notes on the results so far:
>      * A convergence with only 2 engines does suffer from RPC overload (it
>        gets message timeouts on larger templates). I wonder if this is the
>        problem in our convergence gate...
>      * convergence does very well with a reasonable number of engines
>        running.
>      * delete is slightly slower on convergence
>    Still to test:
>      * the above, but measure memory usage
>      * many small templates (run concurrently)

So, I tried running my many-small-templates here with convergence enabled:

https://bugs.launchpad.net/heat/+bug/1489548

In heat.conf I set:

max_resources_per_stack = -1
convergence_engine = true

Most other settings (particularly RPC and DB settings) are defaults.

Without convergence (but with max_resources_per_stack disabled) I see the
time to create a ResourceGroup of 400 nested stacks (each containing one
RandomString resource) is about 2.5 minutes (core i7 laptop w/SSD, 4 heat
workers e.g the default for a 4 core machine).

With convergence enabled, I see these errors from sqlalchemy:

File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 652, in
_checkout\n    fairy = _ConnectionRecord.checkout(pool)\n', u'  File
"/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 444, in
checkout\n    rec = pool._do_get()\n', u'  File
"/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 980, in
_do_get\n    (self.size(), self.overflow(), self._timeout))\n',
u'TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection
timed out, timeout 30\n'].

I assume this means we're loading the DB much more in the convergence case
and overflowing the QueuePool?

This seems to happen when the RPC call from the ResourceGroup tries to
create some of the 400 nested stacks.

Interestingly after this error, the parent stack moves to CREATE_FAILED,
but the engine remains (very) busy, to the point of being partially
responsive, so it looks like maybe the cancel-on-fail isnt' working (I'm
assuming it isn't error_wait_time because the parent stack has been marked
FAILED and I'm pretty sure it's been more than 240s).

I'll dig a bit deeper when I get time, but for now you might like to try
the stress test too.  It's a bit of a synthetic test, but it turns out to
be a reasonable proxy for some performance issues we observed when creating
large-ish TripleO deployments (which also create a large number of nested
stacks concurrently).

Steve



More information about the OpenStack-dev mailing list