[openstack-dev] Our New Weekly(ish) Test Status Report

Matthew Treinish mtreinish at kortar.org
Thu Mar 2 15:20:09 UTC 2017


On Tue, Feb 28, 2017 at 11:49:53AM -0500, Matthew Treinish wrote:
> Hello,
> 
> We have a few particularly annoying bugs that have been impacting the
> reliability of gate testing recently. It would be great if we could get
> volunteers to look at these bugs to improve the reliability of our testing as we
> start working on Pike.
> 
> These two issues have been identified by elastic-recheck as being our biggest
> problems:
> 
> 1. SSH Banner bug http://status.openstack.org/elastic-recheck/#1349617
> 
> This bug is a longstanding issue that comes and goes and also has lots of very
> similar (but subtly different) failure modes. Tempest attempts to ssh into the
> cirros guest and is unable to after 18 attempts over the 300 sec timeout window
> and fails to login. Paramiko reports that there was an issue reading the banner
> returned on port 22 from the guest. This indicates that something is likely
> responding on port 22. We're working on trying to get more details on what is
> the cause here with:
> 
> https://review.openstack.org/437128

We've been doing some more debugging on this issue and made some progress
getting to the bottom of the bug. Jens Rosenboom figured out that the banner
errors are actually being caused by tempest leaking ssh connections (via
paramiko) on auth failures. Dropbear is set to only allow 5 unauthorized
connections per ip address whcih tempest would trip after 5 failed login
attempts. [1] Dropbear would just close the socket after this for login attempt
6 which would cause the banner error. We addressed this in tempest with:

https://review.openstack.org/439638 

since that has merged we haven't seen the banner failure signature anymore, but
it still hasn't solved our ssh connectivity issues. Temepest still isn't able to
login to the guest and fails with an auth error. Kevin Benton has been looking
into this with:

https://bugs.launchpad.net/nova/+bug/1668958

and we're tracking the actual failure signature now: (which only appears after
the tempest fix merged)

http://status.openstack.org/elastic-recheck/gate.html#1668958


The work here is ongoing, but we made enough progress to change the elastic
recheck signature so I figured an update was warranted.

Thanks,

Matt Treinish

[1] https://bugs.launchpad.net/nova/+bug/1668958/comments/4




> 
> 2. Libvirt crashes: http://status.openstack.org/elastic-recheck/#1643911 and
> http://status.openstack.org/elastic-recheck/#1646779
> 
> Libvirt is randomly crashing during the job which causes things to fail (for
> obvious reasons). To address this will likely require someone with experience
> debugging libvirt since it's most likely a bug isolated to libvirt. Tonyb has
> offered to start working on this so talk to him to coordinate efforts around
> fixing this.
> 
> The other thing to note is the oom-killer bug:
> http://status.openstack.org/elastic-recheck/gate.html#1656386 while there aren't
> a lot of hits in logstash for this particular bug, it does raise an import issue
> about the increased memory pressure on the test nodes. It's likely that a lot of
> the instability may be related to the increased load on the nodes. As a starting
> point all projects should look at their memory footprint and see where they can
> trim things to try and make the situation better.
> 
> As a friendly reminder we do track bug rate incidence within our testing using
> the elastic-recheck tool. You can find that data at
> http://status.openstack.org/elastic-recheck. It can be quite useful to start
> there when determining which bugs to fix based on impact. Elastic recheck also
> maintains a list of failures that occurred without a known signature:
> http://status.openstack.org/elastic-recheck/data/integrated_gate.html
> 
> We also need some people to help maintain the list of existing queries, we have
> a lot of queries for closed bugs that have no hits and others which are overly
> broad and matching failures which are unrelated to the bug. This would also be
> good task for a new person to start getting involved with. Feel free to submit
> patches to:
> https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries to
> track new issues.
> 
> Thank you,
> 
> mtreinish and clarkb
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170302/6d5e3695/attachment.pgp>


More information about the OpenStack-dev mailing list