[openstack-dev] [Nova] A multi-cell instance-list performance test

Zhenyu Zheng zhengzhenyulixi at gmail.com
Thu Aug 16 07:53:41 UTC 2018


Hi, Nova

As the Cells v2 architecture is getting mature, and CERN used it and seems
worked well, *Huawei *is also willing to consider using this in our Public
Cloud deployments.
As we still have concerns about the performance when doing multi-cell
listing, recently *Yikun Jiang* and I have done a performance test for
``instance list`` across
multi-cell deployment, we would like share our test results and findings.

First, I want to point out our testing environment, as we(Yikun and I) are
doing this as a concept test(to show the ratio between time consumptions
for query data from
DB and sorting etc.) so we are doing it on our own machine, the machine has
16 CPUs and 80 GB RAM, as it is old, so the Disk might be slow. So we will
not judging
the time consumption data itself, but the overall logic and the ratios
between different steps. We are doing it with a devstack deployment on this
single machine.

Then I would like to share our test plan, we will setup 10 cells
(cell1~cell10) and we will generate 10000 instance records in those cells
(considering 20 instances per
host, it would be like 500 hosts, which seems a good size for a cell),
cell0 is kept empty as the number for errored instance could be very less
and it doesn't really matter.
We will test the time consumption for listing instances across 1,2,5, and
10 cells(cell0 will be always queried, so it is actually 2, 3, 6 and 11
cells) with the limit of
100, 200, 500 and 1000, as the default maximum limit is 1000. In order to
get more general results, we tested the list with default sort key and dir,
sort by
instance_uuid and sort by uuid & name, this should provide a more general
result.

This is what we got(the time unit is second):

*Default sort*

*Uuid* *Sort*

*uuid+name* *Sort*

*Cell*

*Num*

*Limit*


*Total*

*Cost*

*Data Gather Cost*

*Merge Sort Cost*

*Construct View*

*Total*

*Cost*

*Data Gather Cost*

*Merge Sort Cost*

*Construct View*

*Total*

*Cost*

*Data Gather Cost*

*Merge Sort Cost*

*Construct View*

10

100

2.3313

2.1306

0.1145

0.0672

2.3693

2.1343

0.1148

0.1016

2.3284

2.1264

0.1145

0.0679

200

3.5979

3.2137

0.2287

0.1265

3.5316

3.1509

0.2265

0.1255

3.481

3.054

0.2697

0.1284

500

7.1952

6.2597

0.5704

0.3029

7.5057

6.4761

0.6263

0.341

7.4885

6.4623

0.6239

0.3404

1000

13.5745

11.7012

1.1511

0.5966

13.8408

11.9007

1.2268

0.5939

13.8813

11.913

1.2301

0.6187

5

100

1.3142

1.1003

0.1163

0.0706

1.2458

1.0498

0.1163

0.0665

1.2528

1.0579

0.1161

0.066

200

2.0151

1.6063

0.2645

0.1255

1.9866

1.5386

0.2668

0.1615

2.0352

1.6246

0.2646

0.1262

500

4.2109

3.1358

0.7033

0.3343

4.1605

3.0893

0.6951

0.3384

4.1972

3.2461

0.6104

0.3028

1000

7.841

5.8881

1.2027

0.6802

7.7135

5.9121

1.1363

0.5969

7.8377

5.9385

1.1936

0.6376

2

100

0.6736

0.4727

0.1113

0.0822

0.605

0.4192

0.1105

0.0656

0.688

0.4613

0.1126

0.0682

200

1.1226

0.7229

0.2577

0.1255

1.0268

0.6671

0.2255

0.1254

1.2805

0.8171

0.2222

0.1258

500

2.2358

1.3506

0.5595

0.3026

2.3307

1.2748

0.6581

0.3362

2.741

1.6023

0.633

0.3365

1000

4.2079

2.3367

1.2053

0.5986

4.2384

2.4071

1.2017

0.633

4.3437

2.4136

1.217

0.6394

1

100

0.4857

0.2869

0.1097

0.069

0.4205

0.233

0.1131

0.0672

0.6372

0.3305

0.196

0.0681

200

0.6835

0.3236

0.2212

0.1256

0.7777

0.3754

0.261

0.13

0.9245

0.4527

0.227

0.129

500

1.5848

0.6415

0.6251

0.3043

1.6472

0.6554

0.6292

0.3053

1.9455

0.8201

0.5918

0.3447

1000

3.1692

1.2124

1.2246

0.6762

3.0836

1.2286

1.2055

0.643

3.0991

1.2248

1.2615

0.6028

Our conclusions from the data are:
1. The time consumption for *MERGE SORT* process has strong correlation
with the *LIMIT*,  and seems *not *effected by *number of cells;*
2. The major time consumption part of the whole process is actually the
data gathering process, so we will have a closer look on this

With we added some audit log in the code, and from the log we can saw:

02:24:53.376705 db begin, nova_cell0

02:24:53.425836 db end, nova_cell0: 0.0487968921661

02:24:53.426622 db begin, nova_cell1

02:24:54.451235 db end, nova_cell1: 1.02400803566

02:24:54.451991 db begin, nova_cell2

02:24:55.715769 db end, nova_cell2: 1.26333093643

02:24:55.716575 db begin, nova_cell3

02:24:56.963428 db end, nova_cell3: 1.24626398087

02:24:56.964202 db begin, nova_cell4

02:24:57.980187 db end, nova_cell4: 1.01546406746

02:24:57.980970 db begin, nova_cell5

02:24:59.279139 db end, nova_cell5: 1.29762792587

02:24:59.279904 db begin, nova_cell6

02:25:00.311717 db end, nova_cell6: 1.03130197525

02:25:00.312427 db begin, nova_cell7

02:25:01.654819 db end, nova_cell7: 1.34187483788

02:25:01.655643 db begin, nova_cell8

02:25:02.689731 db end, nova_cell8: 1.03352093697

02:25:02.690502 db begin, nova_cell9

02:25:04.076885 db end, nova_cell9: 1.38588285446


yes, the DB query was in serial, after some investigation, it seems that we
are unable to perform eventlet.mockey_patch in uWSGI mode, so Yikun made
this fix:

https://review.openstack.org/#/c/592285/


After making this change, we test again, and we got this kind of data:


total

collect

sort

view

before monkey_patch

13.5745

11.7012

1.1511

0.5966

after monkey_patch

12.8367

10.5471

1.5642

0.6041

The performance improved a little, and from the log we can saw:

Aug 16 02:14:46.383081 begin detail api

Aug 16 02:14:46.406766 begin cell gather begin

Aug 16 02:14:46.419346 db begin, nova_cell0

Aug 16 02:14:46.425065 db begin, nova_cell1

Aug 16 02:14:46.430151 db begin, nova_cell2

Aug 16 02:14:46.435012 db begin, nova_cell3

Aug 16 02:14:46.440634 db begin, nova_cell4

Aug 16 02:14:46.446191 db begin, nova_cell5

Aug 16 02:14:46.450749 db begin, nova_cell6

Aug 16 02:14:46.455461 db begin, nova_cell7

Aug 16 02:14:46.459959 db begin, nova_cell8

Aug 16 02:14:46.466066 db begin, nova_cell9

Aug 16 02:14:46.470550 db begin, ova_cell10

Aug 16 02:14:46.731882 db end, nova_cell0: 0.311906099319

Aug 16 02:14:52.667791 db end, nova_cell5: 6.22100400925

Aug 16 02:14:54.065655 db end, nova_cell1: 7.63998198509

Aug 16 02:14:54.939856 db end, nova_cell3: 8.50425100327

Aug 16 02:14:55.309017 db end, nova_cell6: 8.85762405396

Aug 16 02:14:55.309623 db end, nova_cell8: 8.84928393364

Aug 16 02:14:55.310240 db end, nova_cell2: 8.87976694107

Aug 16 02:14:56.057487 db end, ova_cell10: 9.58636116982

Aug 16 02:14:56.058001 db end, nova_cell4: 9.61698698997

Aug 16 02:14:56.058547 db end, nova_cell9: 9.59216403961

Aug 16 02:14:56.954209 db end, nova_cell7: 10.4981210232

Aug 16 02:14:56.954665 end cell gather end: 10.5480799675

Aug 16 02:14:56.955010 begin heaq.merge

Aug 16 02:14:58.527040 end heaq.merge: 1.57150006294


so, now the queries are in parallel, but the whole thing still seems serial.


We tried to adjust the database configs like: max_thread_pool, use_tpool,
etc. And we also tried to use a separate DB for some of the cells, but the
result

seems to be no big difference.


So, the above are what we have now, and feel free to ping us if you have
any questions or suggestions.


BR,


Zhenyu Zheng
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20180816/94946c10/attachment-0001.html>


More information about the OpenStack-dev mailing list