[openstack-dev] [neutron] Functional job failure rate at 100%

Ihar Hrachyshka ihrachys at redhat.com
Mon Aug 7 16:57:56 UTC 2017


On Mon, Aug 7, 2017 at 2:52 AM, Jakub Libosvar <jlibosva at redhat.com> wrote:
> Hi all,
>
> as per grafana [1] the functional job is broken. Looking at logstash [2]
> it started happening consistently since 2017-08-03 16:27. I didn't find
> any particular patch in Neutron that could cause it.
>
> The culprit is that ovsdb starts misbehaving [3] and then we retry calls
> indefinitely. We still use 2.5.2 openvswitch as we had before. I opened
> a bug [4] and started investigation, I'll update my findings there.
>
> I think at this point there is no reason to run "recheck" on your patches.
>
> Thanks,
> Jakub
>
> [1]
> http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=7&fullscreen
> [2] http://bit.ly/2vdKMwy
> [3]
> http://logs.openstack.org/14/488914/8/check/gate-neutron-dsvm-functional-ubuntu-xenial/75d7482/logs/openvswitch/ovsdb-server.txt.gz
> [4] https://bugs.launchpad.net/neutron/+bug/1709032

Considering all the instability of the job we see lately (this bug
being the latest hit, but we also have bug
https://bugs.launchpad.net/neutron/+bug/1707933, close release, and no
significant resources on digging the issue, I propose to temporarily
disable the job: https://review.openstack.org/#/c/491548/. I also
suggest our mighty leadership to harness awareness of the issue and
rally troops to get it solved.

(to reply to Kevin's request in IRC) To recap what happened with
timeout bug: https://bugs.launchpad.net/neutron/+bug/1707933, it
popped up ~ month ago in master, but it hits Ocata branch too (so it's
either a recent backport, or some external dependency). The way it
happens is one of test worker (almost always running a
FirewallTestCase test case) dies in the middle of run (you can see
'Killed' message in console log, and most of the times, you can also
see the job taking ~2h and the last test worker dying with
'inprogress' state). The first hypothesis was that some (other?) test
case calls execute(['kill', ...]) with the worker PID. To check that,
Jakub proposed https://review.openstack.org/#/c/487065/ and rechecked
for a while until the bug was triggered in the gate. The collected log
suggested that kill was NOT called with the PID. The next step could
be catching all os.kill() calls in all functional tests and logging
their arguments somewhere (with call stacks). We were thinking of
mocking os.kill, replacing it with a function that would log and pass
it to the original implementation, but didn't have time for that so
far.

Regards,
Ihar



More information about the OpenStack-dev mailing list