[openstack-dev] [neutron] CI jobs take pretty long, can we improve that?

Clark Boylan cboylan at sapwetik.org
Tue Mar 22 01:26:45 UTC 2016


On Mon, Mar 21, 2016, at 06:15 PM, Assaf Muller wrote:
> On Mon, Mar 21, 2016 at 8:09 PM, Clark Boylan <cboylan at sapwetik.org>
> wrote:
> > On Mon, Mar 21, 2016, at 01:23 PM, Sean Dague wrote:
> >> On 03/21/2016 04:09 PM, Clark Boylan wrote:
> >> > On Mon, Mar 21, 2016, at 11:49 AM, Clark Boylan wrote:
> >> >> On Mon, Mar 21, 2016, at 11:08 AM, Clark Boylan wrote:
> >> >>> On Mon, Mar 21, 2016, at 09:32 AM, Armando M. wrote:
> >> >>>> Do you have an a better insight of job runtimes vs jobs in other
> >> >>>> projects?
> >> >>>> Most of the time in the job runtime is actually spent setting the
> >> >>>> infrastructure up, and I am not sure we can do anything about it, unless
> >> >>>> we
> >> >>>> take this with Infra.
> >> >>>
> >> >>> I haven't done a comparison yet buts lets break down the runtime of a
> >> >>> recent successful neutron full run against neutron master [0].
> >> >>
> >> >> And now for some comparative data from the gate-tempest-dsvm-full job
> >> >> [0]. This job also ran against a master change that merged and ran in
> >> >> the same cloud and region as the neutron job.
> >> >>
> >> > snip
> >> >> Generally each step of this job was quicker. There were big differences
> >> >> in devstack and tempest run time though. Is devstack much slower to
> >> >> setup neutron when compared to nova net? For tempest it looks like we
> >> >> run ~1510 tests against neutron and only ~1269 against nova net. This
> >> >> may account for the large difference there. I also recall that we run
> >> >> ipv6 tempest tests against neutron deployments that were inefficient and
> >> >> booted 2 qemu VMs per test (not sure if that is still the case but
> >> >> illustrates that the tests themselves may not be very quick in the
> >> >> neutron case).
> >> >
> >> > Looking at the tempest slowest tests output for each of these jobs
> >> > (neutron and nova net) some tests line up really well across jobs and
> >> > others do not. In order to get a better handle on the runtime for
> >> > individual tests I have pushed https://review.openstack.org/295487 which
> >> > will run tempest serially reducing the competition for resources between
> >> > tests.
> >> >
> >> > Hopefully the subunit logs generated by this change can provide more
> >> > insight into where we are losing time during the tempest test runs.
> >
> > The results are in, we have gate-tempest-dsvm-full [0] and
> > gate-tempest-dsvm-neutron-full [1] job results where tempest ran
> > serially to reduce resource contention and provide accurateish per test
> > timing data. Both of these jobs ran on the same cloud so should have
> > comparable performance from the underlying VMs.
> >
> > gate-tempest-dsvm-full
> > Time spent in job before tempest: 700 seconds
> > Time spent running tempest: 2428
> > Tempest tests run: 1269 (113 skipped)
> >
> > gate-tempest-dsvm-neutron-full
> > Time spent in job before tempest: 789 seconds
> > Time spent running tempest: 4407 seconds
> > Tempest tests run: 1510 (76 skipped)
> >
> > All times above are wall time as recorded by Jenkins.
> >
> > We can also compare the 10 slowest tests in the non neutron job against
> > their runtimes in the neutron job. (note this isn't a list of the top 10
> > slowest tests in the neutron job because that job runs extra tests).
> >
> > nova net job
> > tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern
> >                                   85.232
> > tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern
> >                                     83.319
> > tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance
> >                                  50.338
> > tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
> >                                             43.494
> > tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
> >                                     40.225
> > tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
> >                                                39.653
> > tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV1Test.test_volume_backup_create_get_detailed_list_restore_delete
> > 37.720
> > tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete
> > 36.355
> > tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
> >                27.375
> > tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks
> >                 27.025
> >
> > neutron job
> > tempest.scenario.test_volume_boot_pattern.TestVolumeBootPatternV2.test_volume_boot_pattern
> >                                  110.345
> > tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern
> >                                    108.170
> > tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_volume_backed_instance
> >                                  63.852
> > tempest.scenario.test_shelve_instance.TestShelveInstance.test_shelve_instance
> >                                                59.931
> > tempest.scenario.test_snapshot_pattern.TestSnapshotPattern.test_snapshot_pattern
> >                                             57.835
> > tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario
> >                                     49.552
> > tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV1Test.test_volume_backup_create_get_detailed_list_restore_delete
> > 40.378
> > tempest.api.volume.admin.test_volumes_backup.VolumesBackupsV2Test.test_volume_backup_create_get_detailed_list_restore_delete
> > 39.088
> > tempest.scenario.test_encrypted_cinder_volumes.TestEncryptedCinderVolumes.test_encrypted_cinder_volumes_luks
> >                 35.645
> > tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_resize_server_confirm_from_stopped
> >                30.551
> >
> >> Subunit logs aren't the full story here. Activity in addCleanup doesn't
> >> get added to the subunit time accounting for the test, which causes some
> >> interesting issues when waiting for resources to delete. I would be
> >> especially cautious of that on some of these.
> >
> > Based on this those numbers above may not tell the whole story but they
> > do seem to tell us that in comparable circumstances neutron is slower
> > than nova net. Now the sample size is tiny, but again it gives us
> > somewhere to start. What is boot from volume doing in the neutron case
> > that makes it so much slower? Why is shelving so much slower with
> > neutron? and so on.
> >
> > A few seconds here and a few seconds there adds up when these operations
> > are repeated a few hundred times. We can probably start to whittle the
> > job runtime down by shaving that extra time off. In any case I think
> > this is about as far as I can pull this thread, and will let the neutron
> > team take it from here.
> 
> If what we want is to cut down execution time I'd suggest to stop
> running Cinder tests on Neutron patches (Call it as an experiment) and
> see how long it takes for a regression to slip in. Being an
> optimistic, I would guess: Never!

Experience has shown about a week and that its not an if but a when.

> If we're running these tests on Neutron patches solely as a data point
> for performance testing, Tempest is obviously not the tool for the job
> and doesn't provide any added value we can't get from Rally and
> profilers for example. If there's otherwise value for running Cinder
> (And other tests that don't exercise the Neutron API), I'd love to
> know what it is :) I cannot remember any legit Cinder failure on
> Neutron patches.

I think that is the complete wrong approach to take here. We have caught
a problem in neutron your goal should be to fix it not to stop testing
it. The fact that neutron is much slower in these test cases is an
indication that these tests DO exercise the neutron api and that you do
want to cover these code paths and that you need to address them, not
that you should stop testing them.

We are not running these tests on neutron solely for performance
testing. In fact to get reasonable performance testing out of it I had
to jump through a few hoops: make tempest run serially then recheck
until the jobs ran in the same cloud more than once. Performance testing
has never been the goal of these tests. These tests exist to make sure
that OpenStack works. Boot from volume is an important piece of this and
we are making sure that OpenStack (this means glance, nova, neutron,
cinder) continue to work for this use case.

> 
> >
> > [0]
> > http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-full/8e64615/console.html
> > [1]
> > http://logs.openstack.org/87/295487/1/check/gate-tempest-dsvm-neutron-full/5022853/console.html




More information about the OpenStack-dev mailing list