[openstack-dev] [Nova] A multi-cell instance-list performance test

Jim Rollenhagen jim at jimrollenhagen.com
Thu Aug 16 11:48:30 UTC 2018


The data as a table, for anyone reading this in plain text (and the
archives):
https://imgur.com/2NC45n9.png

// jim

On Thu, Aug 16, 2018 at 3:53 AM, Zhenyu Zheng <zhengzhenyulixi at gmail.com>
wrote:

> Hi, Nova
>
> As the Cells v2 architecture is getting mature, and CERN used it and seems
> worked well, *Huawei *is also willing to consider using this in our
> Public Cloud deployments.
> As we still have concerns about the performance when doing multi-cell
> listing, recently *Yikun Jiang* and I have done a performance test for
> ``instance list`` across
> multi-cell deployment, we would like share our test results and findings.
>
> First, I want to point out our testing environment, as we(Yikun and I) are
> doing this as a concept test(to show the ratio between time consumptions
> for query data from
> DB and sorting etc.) so we are doing it on our own machine, the machine
> has 16 CPUs and 80 GB RAM, as it is old, so the Disk might be slow. So we
> will not judging
> the time consumption data itself, but the overall logic and the ratios
> between different steps. We are doing it with a devstack deployment on this
> single machine.
>
> Then I would like to share our test plan, we will setup 10 cells
> (cell1~cell10) and we will generate 10000 instance records in those cells
> (considering 20 instances per
> host, it would be like 500 hosts, which seems a good size for a cell),
> cell0 is kept empty as the number for errored instance could be very less
> and it doesn't really matter.
> We will test the time consumption for listing instances across 1,2,5, and
> 10 cells(cell0 will be always queried, so it is actually 2, 3, 6 and 11
> cells) with the limit of
> 100, 200, 500 and 1000, as the default maximum limit is 1000. In order to
> get more general results, we tested the list with default sort key and dir,
> sort by
> instance_uuid and sort by uuid & name, this should provide a more general
> result.
>
> This is what we got(the time unit is second):
>
> *Default sort*
>
> *Uuid* *Sort*
>
> *uuid+name* *Sort*
>
> *Cell*
>
> *Num*
>
> *Limit*
>
>
> *Total*
>
> *Cost*
>
> *Data Gather Cost*
>
> *Merge Sort Cost*
>
> *Construct View*
>
> *Total*
>
> *Cost*
>
> *Data Gather Cost*
>
> *Merge Sort Cost*
>
> *Construct View*
>
> *Total*
>
> *Cost*
>
> *Data Gather Cost*
>
> *Merge Sort Cost*
>
> *Construct View*
>
> 10
>
> 100
>
> 2.3313
>
> 2.1306
>
> 0.1145
>
> 0.0672
>
> 2.3693
>
> 2.1343
>
> 0.1148
>
> 0.1016
>
> 2.3284
>
> 2.1264
>
> 0.1145
>
> 0.0679
>
> 200
>
> 3.5979
>
> 3.2137
>
> 0.2287
>
> 0.1265
>
> 3.5316
>
> 3.1509
>
> 0.2265
>
> 0.1255
>
> 3.481
>
> 3.054
>
> 0.2697
>
> 0.1284
>
> 500
>
> 7.1952
>
> 6.2597
>
> 0.5704
>
> 0.3029
>
> 7.5057
>
> 6.4761
>
> 0.6263
>
> 0.341
>
> 7.4885
>
> 6.4623
>
> 0.6239
>
> 0.3404
>
> 1000
>
> 13.5745
>
> 11.7012
>
> 1.1511
>
> 0.5966
>
> 13.8408
>
> 11.9007
>
> 1.2268
>
> 0.5939
>
> 13.8813
>
> 11.913
>
> 1.2301
>
> 0.6187
>
> 5
>
> 100
>
> 1.3142
>
> 1.1003
>
> 0.1163
>
> 0.0706
>
> 1.2458
>
> 1.0498
>
> 0.1163
>
> 0.0665
>
> 1.2528
>
> 1.0579
>
> 0.1161
>
> 0.066
>
> 200
>
> 2.0151
>
> 1.6063
>
> 0.2645
>
> 0.1255
>
> 1.9866
>
> 1.5386
>
> 0.2668
>
> 0.1615
>
> 2.0352
>
> 1.6246
>
> 0.2646
>
> 0.1262
>
> 500
>
> 4.2109
>
> 3.1358
>
> 0.7033
>
> 0.3343
>
> 4.1605
>
> 3.0893
>
> 0.6951
>
> 0.3384
>
> 4.1972
>
> 3.2461
>
> 0.6104
>
> 0.3028
>
> 1000
>
> 7.841
>
> 5.8881
>
> 1.2027
>
> 0.6802
>
> 7.7135
>
> 5.9121
>
> 1.1363
>
> 0.5969
>
> 7.8377
>
> 5.9385
>
> 1.1936
>
> 0.6376
>
> 2
>
> 100
>
> 0.6736
>
> 0.4727
>
> 0.1113
>
> 0.0822
>
> 0.605
>
> 0.4192
>
> 0.1105
>
> 0.0656
>
> 0.688
>
> 0.4613
>
> 0.1126
>
> 0.0682
>
> 200
>
> 1.1226
>
> 0.7229
>
> 0.2577
>
> 0.1255
>
> 1.0268
>
> 0.6671
>
> 0.2255
>
> 0.1254
>
> 1.2805
>
> 0.8171
>
> 0.2222
>
> 0.1258
>
> 500
>
> 2.2358
>
> 1.3506
>
> 0.5595
>
> 0.3026
>
> 2.3307
>
> 1.2748
>
> 0.6581
>
> 0.3362
>
> 2.741
>
> 1.6023
>
> 0.633
>
> 0.3365
>
> 1000
>
> 4.2079
>
> 2.3367
>
> 1.2053
>
> 0.5986
>
> 4.2384
>
> 2.4071
>
> 1.2017
>
> 0.633
>
> 4.3437
>
> 2.4136
>
> 1.217
>
> 0.6394
>
> 1
>
> 100
>
> 0.4857
>
> 0.2869
>
> 0.1097
>
> 0.069
>
> 0.4205
>
> 0.233
>
> 0.1131
>
> 0.0672
>
> 0.6372
>
> 0.3305
>
> 0.196
>
> 0.0681
>
> 200
>
> 0.6835
>
> 0.3236
>
> 0.2212
>
> 0.1256
>
> 0.7777
>
> 0.3754
>
> 0.261
>
> 0.13
>
> 0.9245
>
> 0.4527
>
> 0.227
>
> 0.129
>
> 500
>
> 1.5848
>
> 0.6415
>
> 0.6251
>
> 0.3043
>
> 1.6472
>
> 0.6554
>
> 0.6292
>
> 0.3053
>
> 1.9455
>
> 0.8201
>
> 0.5918
>
> 0.3447
>
> 1000
>
> 3.1692
>
> 1.2124
>
> 1.2246
>
> 0.6762
>
> 3.0836
>
> 1.2286
>
> 1.2055
>
> 0.643
>
> 3.0991
>
> 1.2248
>
> 1.2615
>
> 0.6028
>
> Our conclusions from the data are:
> 1. The time consumption for *MERGE SORT* process has strong correlation
> with the *LIMIT*,  and seems *not *effected by *number of cells;*
> 2. The major time consumption part of the whole process is actually the
> data gathering process, so we will have a closer look on this
>
> With we added some audit log in the code, and from the log we can saw:
>
> 02:24:53.376705 db begin, nova_cell0
>
> 02:24:53.425836 db end, nova_cell0: 0.0487968921661
>
> 02:24:53.426622 db begin, nova_cell1
>
> 02:24:54.451235 db end, nova_cell1: 1.02400803566
>
> 02:24:54.451991 db begin, nova_cell2
>
> 02:24:55.715769 db end, nova_cell2: 1.26333093643
>
> 02:24:55.716575 db begin, nova_cell3
>
> 02:24:56.963428 db end, nova_cell3: 1.24626398087
>
> 02:24:56.964202 db begin, nova_cell4
>
> 02:24:57.980187 db end, nova_cell4: 1.01546406746
>
> 02:24:57.980970 db begin, nova_cell5
>
> 02:24:59.279139 db end, nova_cell5: 1.29762792587
>
> 02:24:59.279904 db begin, nova_cell6
>
> 02:25:00.311717 db end, nova_cell6: 1.03130197525
>
> 02:25:00.312427 db begin, nova_cell7
>
> 02:25:01.654819 db end, nova_cell7: 1.34187483788
>
> 02:25:01.655643 db begin, nova_cell8
>
> 02:25:02.689731 db end, nova_cell8: 1.03352093697
>
> 02:25:02.690502 db begin, nova_cell9
>
> 02:25:04.076885 db end, nova_cell9: 1.38588285446
>
>
> yes, the DB query was in serial, after some investigation, it seems that
> we are unable to perform eventlet.mockey_patch in uWSGI mode, so Yikun made
> this fix:
>
> https://review.openstack.org/#/c/592285/
>
>
> After making this change, we test again, and we got this kind of data:
>
>
> total
>
> collect
>
> sort
>
> view
>
> before monkey_patch
>
> 13.5745
>
> 11.7012
>
> 1.1511
>
> 0.5966
>
> after monkey_patch
>
> 12.8367
>
> 10.5471
>
> 1.5642
>
> 0.6041
>
> The performance improved a little, and from the log we can saw:
>
> Aug 16 02:14:46.383081 begin detail api
>
> Aug 16 02:14:46.406766 begin cell gather begin
>
> Aug 16 02:14:46.419346 db begin, nova_cell0
>
> Aug 16 02:14:46.425065 db begin, nova_cell1
>
> Aug 16 02:14:46.430151 db begin, nova_cell2
>
> Aug 16 02:14:46.435012 db begin, nova_cell3
>
> Aug 16 02:14:46.440634 db begin, nova_cell4
>
> Aug 16 02:14:46.446191 db begin, nova_cell5
>
> Aug 16 02:14:46.450749 db begin, nova_cell6
>
> Aug 16 02:14:46.455461 db begin, nova_cell7
>
> Aug 16 02:14:46.459959 db begin, nova_cell8
>
> Aug 16 02:14:46.466066 db begin, nova_cell9
>
> Aug 16 02:14:46.470550 db begin, ova_cell10
>
> Aug 16 02:14:46.731882 db end, nova_cell0: 0.311906099319
>
> Aug 16 02:14:52.667791 db end, nova_cell5: 6.22100400925
>
> Aug 16 02:14:54.065655 db end, nova_cell1: 7.63998198509
>
> Aug 16 02:14:54.939856 db end, nova_cell3: 8.50425100327
>
> Aug 16 02:14:55.309017 db end, nova_cell6: 8.85762405396
>
> Aug 16 02:14:55.309623 db end, nova_cell8: 8.84928393364
>
> Aug 16 02:14:55.310240 db end, nova_cell2: 8.87976694107
>
> Aug 16 02:14:56.057487 db end, ova_cell10: 9.58636116982
>
> Aug 16 02:14:56.058001 db end, nova_cell4: 9.61698698997
>
> Aug 16 02:14:56.058547 db end, nova_cell9: 9.59216403961
>
> Aug 16 02:14:56.954209 db end, nova_cell7: 10.4981210232
>
> Aug 16 02:14:56.954665 end cell gather end: 10.5480799675
>
> Aug 16 02:14:56.955010 begin heaq.merge
>
> Aug 16 02:14:58.527040 end heaq.merge: 1.57150006294
>
>
> so, now the queries are in parallel, but the whole thing still seems
> serial.
>
>
> We tried to adjust the database configs like: max_thread_pool, use_tpool,
> etc. And we also tried to use a separate DB for some of the cells, but the
> result
>
> seems to be no big difference.
>
>
> So, the above are what we have now, and feel free to ping us if you have
> any questions or suggestions.
>
>
> BR,
>
>
> Zhenyu Zheng
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20180816/d4c597a0/attachment-0001.html>


More information about the OpenStack-dev mailing list