[openstack-dev] [neutron][all] switch from mysqldb to another eventlet aware mysql client

Ihar Hrachyshka ihrachys at redhat.com
Tue Jul 15 22:30:37 UTC 2014


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 14/07/14 22:48, Vishvananda Ishaya wrote:
> 
> On Jul 13, 2014, at 9:29 AM, Ihar Hrachyshka <ihrachys at redhat.com>
> wrote:
> 
>> Signed PGP part On 12/07/14 03:17, Mike Bayer wrote:
>>> 
>>> On 7/11/14, 7:26 PM, Carl Baldwin wrote:
>>>> 
>>>> 
>>>> On Jul 11, 2014 5:32 PM, "Vishvananda Ishaya" 
>>>> <vishvananda at gmail.com
>>> <mailto:vishvananda at gmail.com>> wrote:
>>>>> 
>>>>> I have tried using pymysql in place of mysqldb and in real 
>>>>> world
>>> concurrency
>>>>> tests against cinder and nova it performs slower. I was 
>>>>> inspired by
>>> the mention
>>>>> of mysql-connector so I just tried that option instead.
>>> Mysql-connector seems
>>>>> to be slightly slower as well, which leads me to believe
>>>>> that the
>>> blocking inside of
>>>> 
>>>> Do you have some numbers?  "Seems to be slightly slower"
>>>> doesn't
>>> really stand up as an argument against the numbers that have
>>> been posted in this thread.
> 
> Numbers are highly dependent on a number of other factors, but I
> was seeing 100 concurrent list commands against cinder going from
> an average of 400 ms to an average of around 600 ms with both
> msql-connector and pymsql.

I've made my tests on neutron only, so there is possibility that
cinder works somehow differently.

But, those numbers don't tell a lot in terms of considering the
switch. Do you have numbers for mysqldb case?

> 
> It is also worth mentioning that my test of 100 concurrent creates
> from the same project in cinder leads to average response times
> over 3 seconds. Note that creates return before the request is sent
> to the node for processing, so this is just the api creating the db
> record and sticking a message on the queue. A huge part of the
> slowdown is in quota reservation processing which does a row lock
> on the project id.

Again, are those 3 seconds better or worse than what we have for mysqldb?

> 
> Before we are sure that an eventlet friendly backend “gets rid of
> all deadlocks”, I will mention that trying this test against
> connector leads to some requests timing out at our load balancer (5
> minute timeout), so we may actually be introducing deadlocks where
> the retry_on_deadlock operator is used.

Deadlocks != timeouts. I attempt to fix eventlet-triggered db
deadlocks, not all possible deadlocks that you may envision, or timeouts.

> 
> Consider the above anecdotal for the moment, since I can’t verify
> for sure that switching the sql driver didn’t introduce some other
> race or unrelated problem.
> 
> Let me just caution that we can’t recommend replacing our mysql
> backend without real performance and load testing.

I agree. Not saying that the tests are somehow complete, but here is
what I was into last two days.

There is a nice openstack project called Rally that is designed to
allow easy benchmarks for openstack projects. They have four scenarios
for neutron implemented: for networks, ports, routers, and subnets.
Each scenario combines create and list commands.

I've run each test with the following runner settings: times = 100,
concurrency = 10, meaning each scenario is run 100 times in parallel,
and there were not more than 10 parallel scenarios running. Then I've
repeated the same for times = 100, concurrency = 20 (also set
max_pool_size to 20 to allow sqlalchemy utilize that level of
parallelism), and times = 1000, concurrency = 100 (same note on
sqlalchemy parallelism).

You can find detailed html files with nice graphs here [1]. Brief
description of results is below:

1. create_and_list_networks scenario: for 10 parallel workers
performance boost is -12.5% from original time, for 20 workers -6.3%,
for 100 workers there is a slight reduction of average time spent for
scenario +9.4% (this is the only scenario that showed slight reduction
in performance, I'll try to rerun the test tomorrow to see whether it
was some discrepancy when I executed it that influenced the result).

2. create_and_list_ports scenario: for 10 parallel workers boost is
- -25.8%, for 20 workers it's -9.4%, and for 100 workers it's -12.6%.

3. create_and_list_routers scenario: for 10 parallel workers boost is
- -46.6% (almost half of original time), for 20 workers it's -51.7%
(more than a half), for 100 workers it's -41.5%.

4. create_and_list_subnets scenario: for 10 parallel workers boost is
- -26.4%, for 20 workers it's -51.1% (more than half reduction in time
spent for average scenario), and for 100 workers it's -31.7%.

I've tried to check how it scales till 200 parallel workers, but was
hit by local file opened limits and mysql max_connection settings. I
will retry my tests with limits raised tomorrow to see how it handles
that huge load.

Tomorrow I will also try to test new library with multiple API workers.

Other than that, what are your suggestions on what to check/test?

FYI: [1] contains the following directories:

mysqlconnector/
mysqldb/

Each of them contains the following directories:
10-10/ - 10 parallel workers, max_pool_size = 10 (default)
20-100/ - 20 parallel workers, max_pool_size = 100
100-100/ - 100 parallel workers, max_pool_size = 100

Happy analysis!

[1]: http://people.redhat.com/~ihrachys/

/Ihar

> 
> Vish
> 
>>>> 
>>>>> sqlalchemy is not the main bottleneck across projects.
>>>>> 
>>>>> Vish
>>>>> 
>>>>> P.S. The performanace in all cases was abysmal, so
>>>>> performance work
>>> definitely
>>>>> needs to be done, but just the guess that replacing our
>>>>> mysql
>>> library is going to
>>>>> solve all of our performance problems appears to be
>>>>> incorrect at
>>> first blush.
>>>> 
>>>> The motivation is still mostly deadlock relief but more 
>>>> performance
>>> work should be done.  I agree with you there.  I'm still
>>> hopeful for some improvement from this.
>>> 
>>> 
>>> To identify performance that's alleviated by async you have to 
>>> establish up front that IO blocking is the issue, which would 
>>> entail having code that's blazing fast until you start running
>>> it against concurrent connections, at which point you can
>>> identify via profiling that IO operations are being serialized.
>>> This is a very specific issue.
>>> 
>>> In contrast, to identify why some arbitrary openstack app is
>>> slow, my bet is that async is often not the big issue.   Every
>>> day I look at openstack code and talk to people working on
>>> things,  I see many performance issues that have nothing to do
>>> with concurrency, and as I detailed in my wiki page at 
>>> https://wiki.openstack.org/wiki/OpenStack_and_SQLAlchemy there
>>> is a long road to cleaning up all the excessive queries,
>>> hundreds of unnecessary rows and columns being pulled over the
>>> network, unindexed lookups, subquery joins, hammering of
>>> Python-intensive operations (often due to the nature of OS apps
>>> as lots and lots of tiny API calls) that can be cached.
>>> There's a clear path to tons better performance documented
>>> there and most of it is not about async  - which means that
>>> successful async isn't going to solve all those issues.
>>> 
>> 
>> Of course there is a long road to decent performance, and
>> switching a library won't magically fix all out issues. But if it
>> will fix deadlocks, and give 30% to 150% performance boost for
>> different operations, and since the switch is almost smooth, this
>> is something worth doing.
>> 
>>> 
>>> 
>>> 
>>> _______________________________________________ OpenStack-dev 
>>> mailing list OpenStack-dev at lists.openstack.org 
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>>
>>
>>> 
_______________________________________________
>> OpenStack-dev mailing list OpenStack-dev at lists.openstack.org 
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>> 
> 
> 
> _______________________________________________ OpenStack-dev
> mailing list OpenStack-dev at lists.openstack.org 
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCgAGBQJTxauNAAoJEC5aWaUY1u57l24IAJ+1c6OGz6ArEiR32gD0PnPV
Xk1d3c41UJcd+hzJ4sN7cJufdupUNHgbdS6EYZx/5u5gqyN7aWXbBO7hdPbGz/3A
0P39tGE7hcChkzAyE7EuzSGGBCwLeX1dO2guhEE65Cw3fGxODb637SuMOZV3LJGD
b2Z9xq7mrAzOVCV690INeBKA0oT19K0RUGcjJVbND8f3mv/SZ46xJ6EU5F2rFL6h
DrWOE5NkGCm8EsE8YABPls9KrJ9J/97an4jpFGWefBtOFKjnFjTdDDC9OFMdcM27
xvogphKxOk2u8OyKcG56XfoATCkj8ygRQtfqjmFb6dsvp7+jF+8dKyU1yw9eD2I=
=Ef2T
-----END PGP SIGNATURE-----



More information about the OpenStack-dev mailing list