[openstack-dev] [Heat] convergence rally test results (so far)

Anant Patil anant.techie at gmail.com
Wed Sep 2 04:12:57 UTC 2015


When the stack fails, it is marked as FAILED and all the sync points
that are needed to trigger the next set of resources are deleted. The
resources at same level in the graph, like here, they are suppose to
timeout or fail for an exception. Many DB hits means that the cache
data we were maintaining is not being used in the way we intended.

I don't see if really need 1; if it works with legacy w/o putting any such
constraints, it should work with convergence as well.

--
Anant

On Wed, Sep 2, 2015 at 5:23 AM, Angus Salkeld <asalkeld at mirantis.com> wrote:

> On Tue, Sep 1, 2015 at 10:45 PM Steven Hardy <shardy at redhat.com> wrote:
>
>> On Fri, Aug 28, 2015 at 01:35:52AM +0000, Angus Salkeld wrote:
>> >    Hi
>> >    I have been running some rally tests against convergence and our
>> existing
>> >    implementation to compare.
>> >    So far I have done the following:
>> >     1. defined a template with a resource
>> >        groupA
>> https://github.com/asalkeld/convergence-rally/blob/master/templates/resource_group_test_resource.yaml.template
>> >     2. the inner resource looks like
>> >        this:A
>> https://github.com/asalkeld/convergence-rally/blob/master/templates/server_with_volume.yaml.templateA
>> (it
>> >        uses TestResource to attempt to be a reasonable simulation of a
>> >        server+volume+floatingip)
>> >     3. defined a rally
>> >        job:A
>> https://github.com/asalkeld/convergence-rally/blob/master/increasing_resources.yamlA
>> that
>> >        creates X resources then updates to X*2 then deletes.
>> >     4. I then ran the above with/without convergence and with 2,4,8
>> >        heat-engines
>> >    Here are the results compared:
>> >
>> https://docs.google.com/spreadsheets/d/12kRtPsmZBl_y78aw684PTBg3op1ftUYsAEqXBtT800A/edit?usp=sharing
>> >    Some notes on the results so far:
>> >      * A convergence with only 2 engines does suffer from RPC overload
>> (it
>> >        gets message timeouts on larger templates). I wonder if this is
>> the
>> >        problem in our convergence gate...
>> >      * convergence does very well with a reasonable number of engines
>> >        running.
>> >      * delete is slightly slower on convergence
>> >    Still to test:
>> >      * the above, but measure memory usage
>> >      * many small templates (run concurrently)
>>
>> So, I tried running my many-small-templates here with convergence enabled:
>>
>> https://bugs.launchpad.net/heat/+bug/1489548
>>
>> In heat.conf I set:
>>
>> max_resources_per_stack = -1
>> convergence_engine = true
>>
>> Most other settings (particularly RPC and DB settings) are defaults.
>>
>> Without convergence (but with max_resources_per_stack disabled) I see the
>> time to create a ResourceGroup of 400 nested stacks (each containing one
>> RandomString resource) is about 2.5 minutes (core i7 laptop w/SSD, 4 heat
>> workers e.g the default for a 4 core machine).
>>
>> With convergence enabled, I see these errors from sqlalchemy:
>>
>> File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 652, in
>> _checkout\n    fairy = _ConnectionRecord.checkout(pool)\n', u'  File
>> "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 444, in
>> checkout\n    rec = pool._do_get()\n', u'  File
>> "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 980, in
>> _do_get\n    (self.size(), self.overflow(), self._timeout))\n',
>> u'TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection
>> timed out, timeout 30\n'].
>>
>> I assume this means we're loading the DB much more in the convergence case
>> and overflowing the QueuePool?
>>
>
> Yeah, looks like it.
>
>
>>
>> This seems to happen when the RPC call from the ResourceGroup tries to
>> create some of the 400 nested stacks.
>>
>> Interestingly after this error, the parent stack moves to CREATE_FAILED,
>> but the engine remains (very) busy, to the point of being partially
>> responsive, so it looks like maybe the cancel-on-fail isnt' working (I'm
>> assuming it isn't error_wait_time because the parent stack has been marked
>> FAILED and I'm pretty sure it's been more than 240s).
>>
>> I'll dig a bit deeper when I get time, but for now you might like to try
>> the stress test too.  It's a bit of a synthetic test, but it turns out to
>> be a reasonable proxy for some performance issues we observed when
>> creating
>> large-ish TripleO deployments (which also create a large number of nested
>> stacks concurrently).
>>
>
> Thanks a lot for testing Steve! I'll make 2 bugs for what you have raised
> 1. limit the number of resource actions in parallel (maybe base on the
> number of cores)
> 2. the cancel on fail error
>
> -Angus
>
>
>>
>> Steve
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20150902/7e6f12cb/attachment.html>


More information about the OpenStack-dev mailing list