[openstack-dev] Top Gate Bugs

Davanum Srinivas davanum at gmail.com
Fri Dec 6 23:57:37 UTC 2013


I had the labels wrong - here's a slightly better link - http://bit.ly/1gdxYeg

On Fri, Dec 6, 2013 at 4:31 PM, Davanum Srinivas <davanum at gmail.com> wrote:
> Joe,
>
> Looks like we may be a bit more stable now?
>
> Short URL: http://bit.ly/18qq4q2
>
> Long URL : http://graphite.openstack.org/graphlot/?from=-120hour&until=-0hour&target=color(alias(movingAverage(asPercent(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-full.SUCCESS,sum(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-full.{SUCCESS,FAILURE})),'6hours'),%20'gate-tempest-dsvm-postgres-full'),'ED9121')&target=color(alias(movingAverage(asPercent(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-postgres-full.SUCCESS,sum(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-postgres-full.{SUCCESS,FAILURE})),'6hours'),%20'gate-tempest-dsvm-neutron-large-ops'),'00F0F0')&target=color(alias(movingAverage(asPercent(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-neutron.SUCCESS,sum(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-neutron.{SUCCESS,FAILURE})),'6hours'),%20'gate-tempest-dsvm-neutron'),'00FF00')&target=color(alias(movingAverage(asPercent(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-neutron-large-ops.SUCCESS,sum(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-neutron-large-ops.{SUCCESS,FAILURE})),'6hours'),%20'gate-tempest-dsvm-neutron-large-ops'),'00c868')&target=color(alias(movingAverage(asPercent(stats.zuul.pipeline.check.job.check-grenade-dsvm.SUCCESS,sum(stats.zuul.pipeline.check.job.check-grenade-dsvm.{SUCCESS,FAILURE})),'6hours'),%20'check-grenade-dsvm'),'800080')&target=color(alias(movingAverage(asPercent(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-large-ops.SUCCESS,sum(stats.zuul.pipeline.gate.job.gate-tempest-dsvm-large-ops.{SUCCESS,FAILURE})),'6hours'),%20'gate-tempest-dsvm-neutron-large-ops'),'E080FF')
>
> -- dims
>
>
> On Fri, Dec 6, 2013 at 11:28 AM, Matt Riedemann
> <mriedem at linux.vnet.ibm.com> wrote:
>>
>>
>> On Wednesday, December 04, 2013 7:22:23 AM, Joe Gordon wrote:
>>>
>>> TL;DR: Gate is failing 23% of the time due to bugs in nova, neutron
>>> and tempest. We need help fixing these bugs.
>>>
>>>
>>> Hi All,
>>>
>>> Before going any further we have a bug that is affecting gate and
>>> stable, so its getting top priority here. elastic-recheck currently
>>> doesn't track unit tests because we don't expect them to fail very
>>> often. Turns out that assessment was wrong, we now have a nova py27
>>> unit test bug in gate and stable gate.
>>>
>>> https://bugs.launchpad.net/nova/+bug/1216851
>>> Title: nova unit tests occasionally fail migration tests for mysql and
>>> postgres
>>> Hits
>>>   FAILURE: 74
>>> The failures appear multiple times for a single job, and some of those
>>> are due to bad patches in the check queue.  But this is being seen in
>>> stable and trunk gate so something is definitely wrong.
>>>
>>> =======
>>>
>>>
>>> Its time for another edition of of 'Top Gate Bugs.'  I am sending this
>>> out now because in addition to our usual gate bugs a few new ones have
>>> cropped up recently, and as we saw a few weeks ago it doesn't take
>>> very many new bugs to wedge the gate.
>>>
>>> Currently the gate has a failure rate of at least 23%! [0]
>>>
>>> Note: this email was generated with
>>> http://status.openstack.org/elastic-recheck/ and
>>> 'elastic-recheck-success' [1]
>>>
>>> 1) https://bugs.launchpad.net/bugs/1253896
>>> Title: test_minimum_basic_scenario fails with SSHException: Error
>>> reading SSH protocol banner
>>> Projects:  neutron, nova, tempest
>>> Hits
>>>   FAILURE: 324
>>> This one has been around for several weeks now and although we have
>>> made some attempts at fixing this, we aren't any closer at resolving
>>> this then we were a few weeks ago.
>>>
>>> 2) https://bugs.launchpad.net/bugs/1251448
>>> Title: BadRequest: Multiple possible networks found, use a Network ID
>>> to be more specific.
>>> Project: neutron
>>> Hits
>>>   FAILURE: 141
>>>
>>> 3) https://bugs.launchpad.net/bugs/1249065
>>> Title: Tempest failure: tempest/scenario/test_snapshot_pattern.py
>>> Project: nova
>>> Hits
>>>   FAILURE: 112
>>> This is a bug in nova's neutron code.
>>>
>>> 4) https://bugs.launchpad.net/bugs/1250168
>>> Title: gate-tempest-devstack-vm-neutron-large-ops is failing
>>> Projects: neutron, nova
>>> Hits
>>>   FAILURE: 94
>>> This is an old bug that was fixed, but came back on December 3rd. So
>>> this is a recent regression. This may be an infra issue.
>>>
>>> 5) https://bugs.launchpad.net/bugs/1210483
>>> Title: ServerAddressesTestXML.test_list_server_addresses FAIL
>>> Projects: neutron, nova
>>> Hits
>>>   FAILURE: 73
>>> This has had some attempts made at fixing it but its still around.
>>>
>>>
>>> In addition to the existing bugs, we have some new bugs on the rise:
>>>
>>> 1) https://bugs.launchpad.net/bugs/1257626
>>> Title: Timeout while waiting on RPC response - topic: "network", RPC
>>> method: "allocate_for_instance" info: "<unknown>"
>>> Project: nova
>>> Hits
>>>   FAILURE: 52
>>> large-ops only bug. This has been around for at least two weeks, but
>>> we have seen this in higher numbers starting around December 3rd. This
>>> may  be an infrastructure issue as the neutron-large-ops started
>>> failing more around the same time.
>>>
>>> 2) https://bugs.launchpad.net/bugs/1257641
>>> Title: Quota exceeded for instances: Requested 1, but already used 10
>>> of 10 instances
>>> Projects: nova, tempest
>>> Hits
>>>   FAILURE: 41
>>> Like the previous bug, this has been around for at least two weeks but
>>> appears to be on the rise.
>>>
>>>
>>>
>>> Raw Data: http://paste.openstack.org/show/54419/
>>>
>>>
>>> best,
>>> Joe
>>>
>>>
>>> [0] failure rate = 1-(success rate gate-tempest-dsvm-neutron)*(success
>>> rate ...) * ...
>>>
>>> gate-tempest-dsvm-neutron = 0.00
>>> gate-tempest-dsvm-neutron-large-ops = 11.11
>>> gate-tempest-dsvm-full = 11.11
>>> gate-tempest-dsvm-large-ops = 4.55
>>> gate-tempest-dsvm-postgres-full = 10.00
>>> gate-grenade-dsvm = 0.00
>>>
>>> (I hope I got the math right here)
>>>
>>> [1]
>>>
>>> http://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/elastic_recheck/cmd/check_success.py
>>>
>>>
>>> _______________________________________________
>>> OpenStack-dev mailing list
>>> OpenStack-dev at lists.openstack.org
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> Let's add bug 1257644 [1] to the list.  I'm pretty sure this is due to some
>> recent code [2][3] in the nova libvirt driver that is automatically
>> disabling the host when the libvirt connection drops.
>>
>> Joe said there was a known issue with libvirt connection failures so this
>> could be duped against that, but I'm not sure where/what that one is - maybe
>> bug 1254872 [4]?
>>
>> Unless I just don't understand the code, there is some funny logic going on
>> in the libvirt driver when it's automatically disabling a host which I've
>> documented in bug 1257644.  It would help to have some libvirt-minded people
>> helping to look at that, or the authors/approvers of those patches.
>>
>> Also, does anyone know if libvirt will pass a 'reason' string to the
>> _close_callback function?  I was digging through the libvirt code this
>> morning but couldn't figure out where the callback is actually called and
>> with what parameters.  The code in nova seemed to just be based on the patch
>> that danpb had in libvirt [5].
>>
>> This bug is going to raise a bigger long-term question about the need for
>> having a new column in the Service table for indicating whether or not the
>> service was automatically disabled, as Phil Day points out in bug 1250049
>> [6].  That way the ComputeFilter in the scheduler could handle that case a
>> bit differently, at least from a logging/serviceability standpoint, e.g.
>> info/warning level message vs debug.
>>
>> [1] https://bugs.launchpad.net/nova/+bug/1257644
>> [2] https://review.openstack.org/#/c/52189/
>> [3] https://review.openstack.org/#/c/56224/
>> [4] https://bugs.launchpad.net/nova/+bug/1254872
>> [5] http://www.redhat.com/archives/libvir-list/2012-July/msg01675.html
>> [6] https://bugs.launchpad.net/nova/+bug/1250049
>>
>> --
>>
>> Thanks,
>>
>> Matt Riedemann
>>
>>
>> _______________________________________________
>> OpenStack-dev mailing list
>> OpenStack-dev at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> --
> Davanum Srinivas :: http://davanum.wordpress.com



-- 
Davanum Srinivas :: http://davanum.wordpress.com



More information about the OpenStack-dev mailing list