[openstack-dev] [Heat] convergence rally test results (so far)

Fox, Kevin M Kevin.Fox at pnnl.gov
Wed Sep 2 00:13:29 UTC 2015


You can default it to the number of cores, but please make it configurable. Some ops cram lots of services onto one node, and one service doesn't get to monopolize all cores.

Thanks,
Kevin
________________________________
From: Angus Salkeld [asalkeld at mirantis.com]
Sent: Tuesday, September 01, 2015 4:53 PM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [Heat] convergence rally test results (so far)

On Tue, Sep 1, 2015 at 10:45 PM Steven Hardy <shardy at redhat.com<mailto:shardy at redhat.com>> wrote:
On Fri, Aug 28, 2015 at 01:35:52AM +0000, Angus Salkeld wrote:
>    Hi
>    I have been running some rally tests against convergence and our existing
>    implementation to compare.
>    So far I have done the following:
>     1. defined a template with a resource
>        groupA https://github.com/asalkeld/convergence-rally/blob/master/templates/resource_group_test_resource.yaml.template
>     2. the inner resource looks like
>        this:A https://github.com/asalkeld/convergence-rally/blob/master/templates/server_with_volume.yaml.templateA (it
>        uses TestResource to attempt to be a reasonable simulation of a
>        server+volume+floatingip)
>     3. defined a rally
>        job:A https://github.com/asalkeld/convergence-rally/blob/master/increasing_resources.yamlA that
>        creates X resources then updates to X*2 then deletes.
>     4. I then ran the above with/without convergence and with 2,4,8
>        heat-engines
>    Here are the results compared:
>    https://docs.google.com/spreadsheets/d/12kRtPsmZBl_y78aw684PTBg3op1ftUYsAEqXBtT800A/edit?usp=sharing
>    Some notes on the results so far:
>      * A convergence with only 2 engines does suffer from RPC overload (it
>        gets message timeouts on larger templates). I wonder if this is the
>        problem in our convergence gate...
>      * convergence does very well with a reasonable number of engines
>        running.
>      * delete is slightly slower on convergence
>    Still to test:
>      * the above, but measure memory usage
>      * many small templates (run concurrently)

So, I tried running my many-small-templates here with convergence enabled:

https://bugs.launchpad.net/heat/+bug/1489548

In heat.conf I set:

max_resources_per_stack = -1
convergence_engine = true

Most other settings (particularly RPC and DB settings) are defaults.

Without convergence (but with max_resources_per_stack disabled) I see the
time to create a ResourceGroup of 400 nested stacks (each containing one
RandomString resource) is about 2.5 minutes (core i7 laptop w/SSD, 4 heat
workers e.g the default for a 4 core machine).

With convergence enabled, I see these errors from sqlalchemy:

File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 652, in
_checkout\n    fairy = _ConnectionRecord.checkout(pool)\n', u'  File
"/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 444, in
checkout\n    rec = pool._do_get()\n', u'  File
"/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 980, in
_do_get\n    (self.size(), self.overflow(), self._timeout))\n',
u'TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection
timed out, timeout 30\n'].

I assume this means we're loading the DB much more in the convergence case
and overflowing the QueuePool?

Yeah, looks like it.


This seems to happen when the RPC call from the ResourceGroup tries to
create some of the 400 nested stacks.

Interestingly after this error, the parent stack moves to CREATE_FAILED,
but the engine remains (very) busy, to the point of being partially
responsive, so it looks like maybe the cancel-on-fail isnt' working (I'm
assuming it isn't error_wait_time because the parent stack has been marked
FAILED and I'm pretty sure it's been more than 240s).

I'll dig a bit deeper when I get time, but for now you might like to try
the stress test too.  It's a bit of a synthetic test, but it turns out to
be a reasonable proxy for some performance issues we observed when creating
large-ish TripleO deployments (which also create a large number of nested
stacks concurrently).

Thanks a lot for testing Steve! I'll make 2 bugs for what you have raised
1. limit the number of resource actions in parallel (maybe base on the number of cores)
2. the cancel on fail error

-Angus


Steve

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe<http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150902/1b90cd25/attachment.html>


More information about the OpenStack-dev mailing list