Hi, see below. ----- Original Message -----
Retrying the ssh connection with on all ssh exception may help.
It is possible the ssh server causes this type of exception, when the key or the ssh service being configured by cloud-init.
First, tests don't use cloud-init based images to start new nova instances. Cirros images use some similar, but another service to set instance up. See: http://bazaar.launchpad.net/~smoser/cirros/trunk/view/head:/src/sbin/cirros-... The fix in question is for neutron-metadata-agent, and it was not hit by any requests from the new instance created by tempest, meaning the instance either failed to run, or network connection was not properly established. Nova-api log shows that new nova instance state is polled for some time (~6 mins), but its port is always in DOWN state.
It also can hide a temporary network black hole issue.
The instance is created at ~00:59:??, the test fails at ~01:06:??, so it's hardly temporary.
These are not scientifically proven things, but https://review.openstack.org/#/c/73186/.
NOTE: We are using the same ssh code to make connection, in nova network jobs since long..
This review catches another exception type (SSHException). Does it mean that if that would be our issue, we would see SSHException tracebacks in tempest log? There's no such thing there.
The other mentioned changes probably does not have impact to the stability, they mainly improves the logging of the failures.
The 9f756a081533b55f212221ea5de8ed968acea273 and the following patches might decrease the load on the l3 agent, but it would be more difficult to backport.
I do not remember anything else in tempest what may help to make the stable/havana neutron jobs more stable.
There was also some bug in file injection to a new instance in gate that made ssh sessions fail. Something related to guestfs, but I don't know all the details. Adding Russel to Cc since he may have more info on this.
Best Regards, Attila
----- Original Message -----
From: "Alan Pevec" <apevec@gmail.com> To: "Gary Kotton" <gkotton@vmware.com>, "Attila Fazekas" <afazekas@redhat.com>, "Joe Gordon" <joe.gordon0@gmail.com>, "David Kranz" <dkranz@redhat.com>, mtreinish@kortar.org, "Sean Dague" <sean@dague.net> Cc: "openstack-stable-maint" <openstack-stable-maint@lists.openstack.org> Sent: Wednesday, February 12, 2014 11:44:58 PM Subject: Re: [Openstack-stable-maint] 2013.2.2 exception requests
Copying authors of tempest patches referenced below + few Tempest core members who might be interested.
https://review.openstack.org/#/c/72754/ That's a good candidate for exception, and I see Neutron stable-maint members already approved but it's failing *-isolated gate jobs. I'll try throwing dice few more times, but could someone familiar have a look? What are those jobs doing?
Ihar commented in the review: " I suspect tempest lacks some of those ssh.py fixes from master: c3128c085c2635d82c4909d1be5d016df4978632 ad7ef7d1bdd98045639ee4045144c8fe52853e76 31a91a605a25f578b51a7bed2df8fde5c5f49ffc I'm not sure this would be enough to stabilize gate though."
Gary, Attila, Joe - would you like to backport your patches to stable/havana Tempest? Do you agree they should improve gate stability and is there anything else to be backported to stabilize *-isolated gate jobs?
Thanks, Alan
_______________________________________________ Openstack-stable-maint mailing list Openstack-stable-maint@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-stable-maint