[openstack-dev] [neutron] CI jobs take pretty long, can we improve that?

Assaf Muller assaf at redhat.com
Tue Mar 22 14:46:41 UTC 2016


On Tue, Mar 22, 2016 at 9:31 AM, Kevin Benton <kevin at benton.pub> wrote:
> Thanks for doing this. I dug into the test_volume_boot_pattern test to see
> what was going on.
>
> On the first boot, Nova called Neutron to create a port at 23:29:44 and it
> took 441ms to return the port to Nova.[1]
> Nova then plugged the interface for that port into OVS a little over 6
> seconds later at 23:29:50.[2]
> The Neutron agent attempted to process this on the iteration at 23:29:52
> [3]; however, it didn't get the ofport populated from the OVSDB monitor... a
> bug![4] The Neutron agent did catch it on the next iteration two seconds
> later on a retry and notified the Neutron server at 23:29:54.[5]

Good work as usual Kevin, just approved the fix to this bug.

> The Neutron server processed the port ACTIVE change in just under 80ms[6],
> but it did not dispatch the notification to Nova until 2 seconds later at
> 23:29:56 [7] due to the Nova notification batching mechanism[8].
>
> Total time between port create and boot is about 12 seconds. 6 in Nova and 6
> in Neutron.
>
> For the Neutron side, the bug fix should eliminate 2 seconds. We could
> probably make the Nova notifier batching mechanism a little more aggressive
> so it only batches up calls in a very short interval rather than making 2
> second buckets at all times. The remaining 2 seconds is just the agent
> processing loop interval, which can be tuned with a config but it should be
> okay if that's the only bottleneck.
>
> For Nova, we need to improve that 6 seconds after it has created the Neutron
> port before it has plugged it into the vswitch. I can see it makes some
> other calls to Neutron in this time to list security groups and floating
> IPs. Maybe this can be done asynchronously because I don't think they should
> block the initial VM boot to pause that plugs in the VIF.
>
> Completely unrelated to the boot process, the entire tempest run spent ~412
> seconds building and destroying Neutron resources in setup and teardown.[9]
> However, considering the number of tests executed, this seems reasonable so
> I'm not sure we need to work on optimizing that yet.
>
>
> Cheers,
> Kevin Benton
>
>
> 1.
> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_45_341
> 2.
> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-n-cpu.txt.gz#_2016-03-21_23_29_50_629
> 3.
> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-agt.txt.gz#_2016-03-21_23_29_52_216
> 4. https://bugs.launchpad.net/neutron/+bug/1560464
> 5.
> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-agt.txt.gz#_2016-03-21_23_29_54_738
> 6.
> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_54_813
> 7.
> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/logs/screen-q-svc.txt.gz#_2016-03-21_23_29_56_782
> 8.
> http://git.openstack.org/cgit/openstack/neutron/tree/neutron/notifiers/nova.py
> 9. egrep -R 'tearDown|setUp' tempest.txt.gz | grep 9696 | awk '{print
> $(NF)}' | ./fsum
>
> On Mon, Mar 21, 2016 at 5:09 PM, Clark Boylan <cboylan at sapwetik.org> wrote:
>>
>> On Mon, Mar 21, 2016, at 01:23 PM, Sean Dague wrote:
>> > On 03/21/2016 04:09 PM, Clark Boylan wrote:
>> > > On Mon, Mar 21, 2016, at 11:49 AM, Clark Boylan wrote:
>> > >> On Mon, Mar 21, 2016, at 11:08 AM, Clark Boylan wrote:
>> > >>> On Mon, Mar 21, 2016, at 09:32 AM, Armando M. wrote:
>> > >>>> Do you have an a better insight of job runtimes vs jobs in other
>> > >>>> projects?
>> > >>>> Most of the time in the job runtime is actually spent setting the
>> > >>>> infrastructure up, and I am not sure we can do anything about it,
>> > >>>> unless
>> > >>>> we
>> > >>>> take this with Infra.
>> > >>>
>> > >>> I haven't done a comparison yet buts lets break down the runtime of
>> > >>> a
>> > >>> recent successful neutron full run against neutron master [0].
>> > >>
>> > >> And now for some comparative data from the gate-tempest-dsvm-full job
>> > >> [0]. This job also ran against a master change that merged and ran in
>> > >> the same cloud and region as the neutron job.
>> > >>
>> > > snip
>> > >> Generally each step of this job was quicker. There were big
>> > >> differences
>> > >> in devstack and tempest run time though. Is devstack much slower to
>> > >> setup neutron when compared to nova net? For tempest it looks like we
>> > >> run ~1510 tests against neutron and only ~1269 against nova net. This
>> > >> may account for the large difference there. I also recall that we run
>> > >> ipv6 tempest tests against neutron deployments that were inefficient
>> > >> and
>> > >> booted 2 qemu VMs per test (not sure if that is still the case but
>> > >> illustrates that the tests themselves may not be very quick in the
>> > >> neutron case).
>> > >
>> > > Looking at the tempest slowest tests output for each of these jobs
>> > > (neutron and nova net) some tests line up really well across jobs and
>> > > others do not. In order to get a better handle on the runtime for
>> > > individual tests I have pushed https://review.openstack.org/295487
>> > > which
>> > > will run tempest serially reducing the competition for resources
>> > > between
>> > > tests.
>> > >
>> > > Hopefully the subunit logs generated by this change can provide more
>> > > insight into where we are losing time during the tempest test runs.
>>
>> The results are in, we have gate-tempest-dsvm-full [0] and
>> gate-tempest-dsvm-neutron-full [1] job results where tempest ran
>> serially to reduce resource contention and provide accurateish per test
>> timing data. Both of these jobs ran on the same cloud so should have
>> comparable performance from the underlying VMs.
>>
>> gate-tempest-dsvm-full
>> Time spent in job before tempest: 700 seconds
>> Time spent running tempest: 2428
>> Tempest tests run: 1269 (113 skipped)
>>
>> gate-tempest-dsvm-neutron-full
>> Time spent in job before tempest: 789 seconds
>> Time spent running tempest: 4407 seconds
>> Tempest tests run: 1510 (76 skipped)
>>
>> All times above are wall time as recorded by Jenkins.
>>
>> We can also compare the 10 slowest tests in the non neutron job against
>> their runtimes in the neutron job. (note this isn't a list of the top 10
>> slowest tests in the neutron job because that job runs extra tests).
>>
>> nova net job
>>
>> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern
>>                                   85.232
>>
>> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern
>>                                     83.319
>>
>> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance
>>                                  50.338
>>
>> tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
>>                                             43.494
>>
>> tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
>>                                     40.225
>>
>> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
>>                                                39.653
>>
>> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV1Test.test_volume_backup_create_get_detailed_list_restore_delete
>> 37.720
>>
>> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete
>> 36.355
>>
>> tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
>>                27.375
>>
>> tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks
>>                 27.025
>>
>> neutron job
>>
>> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern
>>                                  110.345
>>
>> tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern
>>                                    108.170
>>
>> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance
>>                                  63.852
>>
>> tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
>>                                                59.931
>>
>> tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
>>                                             57.835
>>
>> tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
>>                                     49.552
>>
>> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV1Test.test_volume_backup_create_get_detailed_list_restore_delete
>> 40.378
>>
>> tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete
>> 39.088
>>
>> tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks
>>                 35.645
>>
>> tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
>>                30.551
>>
>> > Subunit logs aren't the full story here. Activity in addCleanup doesn't
>> > get added to the subunit time accounting for the test, which causes some
>> > interesting issues when waiting for resources to delete. I would be
>> > especially cautious of that on some of these.
>>
>> Based on this those numbers above may not tell the whole story but they
>> do seem to tell us that in comparable circumstances neutron is slower
>> than nova net. Now the sample size is tiny, but again it gives us
>> somewhere to start. What is boot from volume doing in the neutron case
>> that makes it so much slower? Why is shelving so much slower with
>> neutron? and so on.
>>
>> A few seconds here and a few seconds there adds up when these operations
>> are repeated a few hundred times. We can probably start to whittle the
>> job runtime down by shaving that extra time off. In any case I think
>> this is about as far as I can pull this thread, and will let the neutron
>> team take it from here.
>>
>> [0]
>>
>> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-full/8e64615/console.html
>> [1]
>>
>> http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/console.html
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



More information about the OpenStack-dev mailing list