[openstack-dev] [Neutron][qa] Parallel testing update
Salvatore Orlando
sorlando at nicira.com
Fri Dec 27 10:09:02 UTC 2013
Hi,
We now have several patches under review which improve a lot how neutron
handles parallel testing.
In a nutshell, these patches try to ensure the ovs agent processes new,
removed, and updated interfaces as soon as possible,
These patches are:
https://review.openstack.org/#/c/61105/
https://review.openstack.org/#/c/61964/
https://review.openstack.org/#/c/63100/
https://review.openstack.org/#/c/63558/
There is still room for improvement. For instance the calls from the agent
into the plugins might be consistently reduced.
However, even if the above patches shrink a lot the time required for
processing a device, we are still hitting a hard limit with the execution
ovs commands for setting local vlan tags and clearing flows (or adding the
flow rule for dropping all the traffic).
In some instances this commands slow down a lot, requiring almost 10
seconds to complete. This adds a delay in interface processing which in
some cases leads to the hideous SSH timeout error (the same we see with bug
1253896 in normal testing).
It is also worth noting that when this happens sysstat reveal CPU usage is
very close to 100%
>From the neutron side there is little we can do. Introducing parallel
processing for interface, as we do for the l3 agent, is not actually a
solution, since ovs-vswitchd v1.4.x, the one executed on gate tests, is not
multithreaded. If you think the situation might be improved by changing the
logic for handling local vlan tags and putting ports on the dead vlan, I
would be happy to talk about that.
On my local machines I've seen a dramatic improvement in processing times
by installing ovs 2.0.0, which has a multi-threaded vswitchd. Is this
something we might consider for gate tests? Also, in order to reduce CPU
usage on the gate (and making tests a bit faster), there is a tempest patch
which stops creating and wiring neutron routers when they're not needed:
https://review.openstack.org/#/c/62962/
Even in my local setup which succeeds about 85% of times, I'm still seeing
some occurrences of the issue described in [1], which at the end of the day
seems a dnsmasq issue.
Beyond the 'big' structural problem discussed above, there are some minor
problems with a few tests:
1) test_network_quotas.test_create_ports_until_quota_hit fails about 90%
of times. I think this is because the test itself should be made aware of
parallel execution and asynchronous events, and there is a patch for this
already: https://review.openstack.org/#/c/64217
2) test_attach_interfaces.test_create_list_show_delete_interfaces fails
about 66% of times. The failure is always on an assertion made after
deletion of interfaces, which probably means the interface is not deleted
within 5 seconds. I think this might be a consequence of the higher load on
the neutron service and we might try to enable multiple workers on the gate
to this aim, or just increase the tempest timeout. On a slightly different
note, allow me to say that the way assertion are made on this test might be
improved a bit. So far one has to go through the code to see why the test
failed.
Thanks for reading this rather long message.
Regards,
Salvatore
[1] https://lists.launchpad.net/openstack/msg23817.html
On 2 December 2013 22:01, Kyle Mestery (kmestery) <kmestery at cisco.com>wrote:
> Yes, this is all great Salvatore and Armando! Thank you for all of this
> work
> and the explanation behind it all.
>
> Kyle
>
> On Dec 2, 2013, at 2:24 PM, Eugene Nikanorov <enikanorov at mirantis.com>
> wrote:
>
> > Salvatore and Armando, thanks for your great work and detailed
> explanation!
> >
> > Eugene.
> >
> >
> > On Mon, Dec 2, 2013 at 11:48 PM, Joe Gordon <joe.gordon0 at gmail.com>
> wrote:
> >
> > On Dec 2, 2013 9:04 PM, "Salvatore Orlando" <sorlando at nicira.com> wrote:
> > >
> > > Hi,
> > >
> > > As you might have noticed, there has been some progress on parallel
> tests for neutron.
> > > In a nutshell:
> > > * Armando fixed the issue with IP address exhaustion on the public
> network [1]
> > > * Salvatore has now a patch which has a 50% success rate (the last
> failures are because of me playing with it) [2]
> > > * Salvatore is looking at putting back on track full isolation [3]
> > > * All the bugs affecting parallel tests can be queried here [10]
> > > * This blueprint tracks progress made towards enabling parallel
> testing [11]
> > >
> > > ---------
> > > The long story is as follows:
> > > Parallel testing basically is not working because parallelism means
> higher contention for public IP addresses. This was made worse by the fact
> that some tests created a router with a gateway set but never deleted it.
> As a result, there were even less addresses in the public range.
> > > [1] was already merged and with [4] we shall make the public network
> for neutron a /24 (the full tempest suite is still showing a lot of IP
> exhaustion errors).
> > >
> > > However, this was just one part of the issue. The biggest part
> actually lied with the OVS agent and its interactions with the ML2 plugin.
> A few patches ([5], [6], [7]) were already pushed to reduce the number of
> notifications sent from the plugin to the agent. However, the agent is
> organised in a way such that a notification is immediately acted upon thus
> preempting the main agent loop, which is the one responsible for wiring
> ports into networks. Considering the high level of notifications currently
> sent from the server, this becomes particularly wasteful if one consider
> that security membership updates for ports trigger global
> iptables-save/restore commands which are often executed in rapid
> succession, thus resulting in long delays for wiring VIFs to the
> appropriate network.
> > > With the patch [2] we are refactoring the agent to make it more
> efficient. This is not production code, but once we'll get close to 100%
> pass for parallel testing this patch will be split in several patches,
> properly structured, and hopefully easy to review.
> > > It is worth noting there is still work to do: in some cases the loop
> still takes too long, and it has been observed ovs commands taking even 10
> seconds to complete. To this aim, it is worth considering use of async
> processes introduced in [8] as well as leveraging ovsdb monitoring [9] for
> limiting queries to ovs database.
> > > We're still unable to explain some failures where the network appears
> to be correctly wired (floating IP, router port, dhcp port, and VIF port),
> but the SSH connection fails. We're hoping to reproduce this failure patter
> locally.
> > >
> > > Finally, the tempest patch for full tempest isolation should be made
> usable soon. Having another experimental job for it is something worth
> considering as for some reason it is not always easy reproducing the same
> failure modes exhibited on the gate.
> > >
> > > Regards,
> > > Salvatore
> > >
> >
> > Awesome work, thanks for the update.
> >
> >
> > > [1] https://review.openstack.org/#/c/58054/
> > > [2] https://review.openstack.org/#/c/57420/
> > > [3] https://review.openstack.org/#/c/53459/
> > > [4] https://review.openstack.org/#/c/58284/
> > > [5] https://review.openstack.org/#/c/58860/
> > > [6] https://review.openstack.org/#/c/58597/
> > > [7] https://review.openstack.org/#/c/58415/
> > > [8] https://review.openstack.org/#/c/45676/
> > > [9] https://bugs.launchpad.net/neutron/+bug/1177973
> > > [10]
> https://bugs.launchpad.net/neutron/+bugs?field.tag=neutron-parallel&field.tags_combinator=ANY
> > > [11]
> https://blueprints.launchpad.net/neutron/+spec/neutron-tempest-parallel
> > >
> > > _______________________________________________
> > > OpenStack-dev mailing list
> > > OpenStack-dev at lists.openstack.org
> > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> > >
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> > _______________________________________________
> > OpenStack-dev mailing list
> > OpenStack-dev at lists.openstack.org
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20131227/b07d41cd/attachment.html>
More information about the OpenStack-dev
mailing list