Open Stack

Fri Dec 6 16:28:46 UTC 2013

On Wednesday, December 04, 2013 7:22:23 AM, Joe Gordon wrote:
> TL;DR: Gate is failing 23% of the time due to bugs in nova, neutron
> and tempest. We need help fixing these bugs.
>
>
> Hi All,
>
> Before going any further we have a bug that is affecting gate and
> stable, so its getting top priority here. elastic-recheck currently
> doesn't track unit tests because we don't expect them to fail very
> often. Turns out that assessment was wrong, we now have a nova py27
> unit test bug in gate and stable gate.
>
> https://bugs.launchpad.net/nova/+bug/1216851
> Title: nova unit tests occasionally fail migration tests for mysql and
> postgres
> Hits
>   FAILURE: 74
> The failures appear multiple times for a single job, and some of those
> are due to bad patches in the check queue.  But this is being seen in
> stable and trunk gate so something is definitely wrong.
>
> =======
>
>
> Its time for another edition of of 'Top Gate Bugs.'  I am sending this
> out now because in addition to our usual gate bugs a few new ones have
> cropped up recently, and as we saw a few weeks ago it doesn't take
> very many new bugs to wedge the gate.
>
> Currently the gate has a failure rate of at least 23%! [0]
>
> Note: this email was generated with
> http://status.openstack.org/elastic-recheck/ and
> 'elastic-recheck-success' [1]
>
> 1) https://bugs.launchpad.net/bugs/1253896
> Title: test_minimum_basic_scenario fails with SSHException: Error
> reading SSH protocol banner
> Projects:  neutron, nova, tempest
> Hits
>   FAILURE: 324
> This one has been around for several weeks now and although we have
> made some attempts at fixing this, we aren't any closer at resolving
> this then we were a few weeks ago.
>
> 2) https://bugs.launchpad.net/bugs/1251448
> Title: BadRequest: Multiple possible networks found, use a Network ID
> to be more specific.
> Project: neutron
> Hits
>   FAILURE: 141
>
> 3) https://bugs.launchpad.net/bugs/1249065
> Title: Tempest failure: tempest/scenario/test_snapshot_pattern.py
> Project: nova
> Hits
>   FAILURE: 112
> This is a bug in nova's neutron code.
>
> 4) https://bugs.launchpad.net/bugs/1250168
> Title: gate-tempest-devstack-vm-neutron-large-ops is failing
> Projects: neutron, nova
> Hits
>   FAILURE: 94
> This is an old bug that was fixed, but came back on December 3rd. So
> this is a recent regression. This may be an infra issue.
>
> 5) https://bugs.launchpad.net/bugs/1210483
> Title: ServerAddressesTestXML.test_list_server_addresses FAIL
> Projects: neutron, nova
> Hits
>   FAILURE: 73
> This has had some attempts made at fixing it but its still around.
>
>
> In addition to the existing bugs, we have some new bugs on the rise:
>
> 1) https://bugs.launchpad.net/bugs/1257626
> Title: Timeout while waiting on RPC response - topic: "network", RPC
> method: "allocate_for_instance" info: "<unknown>"
> Project: nova
> Hits
>   FAILURE: 52
> large-ops only bug. This has been around for at least two weeks, but
> we have seen this in higher numbers starting around December 3rd. This
> may  be an infrastructure issue as the neutron-large-ops started
> failing more around the same time.
>
> 2) https://bugs.launchpad.net/bugs/1257641
> Title: Quota exceeded for instances: Requested 1, but already used 10
> of 10 instances
> Projects: nova, tempest
> Hits
>   FAILURE: 41
> Like the previous bug, this has been around for at least two weeks but
> appears to be on the rise.
>
>
>
> Raw Data: http://paste.openstack.org/show/54419/
>
>
> best,
> Joe
>
>
> [0] failure rate = 1-(success rate gate-tempest-dsvm-neutron)*(success
> rate ...) * ...
>
> gate-tempest-dsvm-neutron = 0.00
> gate-tempest-dsvm-neutron-large-ops = 11.11
> gate-tempest-dsvm-full = 11.11
> gate-tempest-dsvm-large-ops = 4.55
> gate-tempest-dsvm-postgres-full = 10.00
> gate-grenade-dsvm = 0.00
>
> (I hope I got the math right here)
>
> [1]
> http://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/elastic_recheck/cmd/check_success.py
>
>
> _______________________________________________
> OpenStack-dev mailing list
> OpenStack-dev at lists.openstack.org
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Let's add bug 1257644 [1] to the list.  I'm pretty sure this is due to 
some recent code [2][3] in the nova libvirt driver that is 
automatically disabling the host when the libvirt connection drops.

Joe said there was a known issue with libvirt connection failures so 
this could be duped against that, but I'm not sure where/what that one 
is - maybe bug 1254872 [4]?

Unless I just don't understand the code, there is some funny logic 
going on in the libvirt driver when it's automatically disabling a host 
which I've documented in bug 1257644.  It would help to have some 
libvirt-minded people helping to look at that, or the authors/approvers 
of those patches.

Also, does anyone know if libvirt will pass a 'reason' string to the 
_close_callback function?  I was digging through the libvirt code this 
morning but couldn't figure out where the callback is actually called 
and with what parameters.  The code in nova seemed to just be based on 
the patch that danpb had in libvirt [5].

This bug is going to raise a bigger long-term question about the need 
for having a new column in the Service table for indicating whether or 
not the service was automatically disabled, as Phil Day points out in 
bug 1250049 [6].  That way the ComputeFilter in the scheduler could 
handle that case a bit differently, at least from a 
logging/serviceability standpoint, e.g. info/warning level message vs 
debug.

[1] https://bugs.launchpad.net/nova/+bug/1257644
[2] https://review.openstack.org/#/c/52189/
[3] https://review.openstack.org/#/c/56224/
[4] https://bugs.launchpad.net/nova/+bug/1254872
[5] http://www.redhat.com/archives/libvir-list/2012-July/msg01675.html
[6] https://bugs.launchpad.net/nova/+bug/1250049

--

Thanks,

Matt Riedemann

Open Stack

[openstack-dev] Top Gate Bugs

OpenStack

Community

Documentation

Branding & Legal