[openstack-dev] [Neutron][qa] Parallel testing update
Kyle Mestery
mestery at siliconloons.com
Thu Jan 2 18:53:05 UTC 2014
Thanks for the updates here Salvatore, and for continuing to push on
this! This is all great work!
On Jan 2, 2014, at 6:57 AM, Salvatore Orlando <sorlando at nicira.com> wrote:
>
> Hi again,
>
> I've now run the experimental job a good deal of times, and I've filed bugs for all the issues which came out.
> Most of them occurred no more than once among all test execution (I think about 30).
>
> They're all tagged with neutron-parallel [1]. for ease of tracking, I've associated all the bug reports with neutron, but some are probably more tempest or nova issues.
>
> Salvatore
>
> [1] https://bugs.launchpad.net/neutron/+bugs?field.tag=neutron-parallel
>
>
> On 27 December 2013 11:09, Salvatore Orlando <sorlando at nicira.com> wrote:
> Hi,
>
> We now have several patches under review which improve a lot how neutron handles parallel testing.
> In a nutshell, these patches try to ensure the ovs agent processes new, removed, and updated interfaces as soon as possible,
>
> These patches are:
> https://review.openstack.org/#/c/61105/
> https://review.openstack.org/#/c/61964/
> https://review.openstack.org/#/c/63100/
> https://review.openstack.org/#/c/63558/
>
> There is still room for improvement. For instance the calls from the agent into the plugins might be consistently reduced.
> However, even if the above patches shrink a lot the time required for processing a device, we are still hitting a hard limit with the execution ovs commands for setting local vlan tags and clearing flows (or adding the flow rule for dropping all the traffic).
> In some instances this commands slow down a lot, requiring almost 10 seconds to complete. This adds a delay in interface processing which in some cases leads to the hideous SSH timeout error (the same we see with bug 1253896 in normal testing).
> It is also worth noting that when this happens sysstat reveal CPU usage is very close to 100%
>
> From the neutron side there is little we can do. Introducing parallel processing for interface, as we do for the l3 agent, is not actually a solution, since ovs-vswitchd v1.4.x, the one executed on gate tests, is not multithreaded. If you think the situation might be improved by changing the logic for handling local vlan tags and putting ports on the dead vlan, I would be happy to talk about that.
> On my local machines I've seen a dramatic improvement in processing times by installing ovs 2.0.0, which has a multi-threaded vswitchd. Is this something we might consider for gate tests? Also, in order to reduce CPU usage on the gate (and making tests a bit faster), there is a tempest patch which stops creating and wiring neutron routers when they're not needed: https://review.openstack.org/#/c/62962/
>
> Even in my local setup which succeeds about 85% of times, I'm still seeing some occurrences of the issue described in [1], which at the end of the day seems a dnsmasq issue.
>
> Beyond the 'big' structural problem discussed above, there are some minor problems with a few tests:
>
> 1) test_network_quotas.test_create_ports_until_quota_hit fails about 90% of times. I think this is because the test itself should be made aware of parallel execution and asynchronous events, and there is a patch for this already: https://review.openstack.org/#/c/64217
>
> 2) test_attach_interfaces.test_create_list_show_delete_interfaces fails about 66% of times. The failure is always on an assertion made after deletion of interfaces, which probably means the interface is not deleted within 5 seconds. I think this might be a consequence of the higher load on the neutron service and we might try to enable multiple workers on the gate to this aim, or just increase the tempest timeout. On a slightly different note, allow me to say that the way assertion are made on this test might be improved a bit. So far one has to go through the code to see why the test failed.
>
> Thanks for reading this rather long message.
> Regards,
> Salvatore
>
> [1] https://lists.launchpad.net/openstack/msg23817.html
>
>
>
>
> On 2 December 2013 22:01, Kyle Mestery (kmestery) <kmestery at cisco.com> wrote:
> Yes, this is all great Salvatore and Armando! Thank you for all of this work
> and the explanation behind it all.
>
> Kyle
>
> On Dec 2, 2013, at 2:24 PM, Eugene Nikanorov <enikanorov at mirantis.com> wrote:
>
> > Salvatore and Armando, thanks for your great work and detailed explanation!
> >
> > Eugene.
> >
> >
> > On Mon, Dec 2, 2013 at 11:48 PM, Joe Gordon <joe.gordon0 at gmail.com> wrote:
> >
> > On Dec 2, 2013 9:04 PM, "Salvatore Orlando" <sorlando at nicira.com> wrote:
> > >
> > > Hi,
> > >
> > > As you might have noticed, there has been some progress on parallel tests for neutron.
> > > In a nutshell:
> > > * Armando fixed the issue with IP address exhaustion on the public network [1]
> > > * Salvatore has now a patch which has a 50% success rate (the last failures are because of me playing with it) [2]
> > > * Salvatore is looking at putting back on track full isolation [3]
> > > * All the bugs affecting parallel tests can be queried here [10]
> > > * This blueprint tracks progress made towards enabling parallel testing [11]
> > >
More information about the OpenStack-dev
mailing list