[glance] functional-py35 job fails due to subunit.praser error

Matthew Treinish mtreinish at kortar.org
Thu Dec 13 20:05:17 UTC 2018


On Thu, Dec 13, 2018 at 10:01:06AM -0800, Clark Boylan wrote:
> On Thu, Dec 13, 2018, at 9:38 AM, Abhishek Kekane wrote:
> > Hi All,
> > 
> > From 7th December 2018, all of sudden functional-py35 check job is failing
> > on current master. Earlier this job was failing intermittently (2-3 times
> > from 100). I have tried to dig into the issue and find out that it is
> > failing obly for ubuntu and working fine with Fedora, centos etc. Below are
> > some of the findings:
> > 
> > 1. Even if I add failure tests [1] in black-list file every time other
> > tests reported as failure.
> > 2. When I compared failed tests results [1] with passed tests results [2] I
> > have found that passing tests shows 'Pass 741 Skip 7' whereas failing tests
> > shows 'Pass 627 Failure 9 Skip 5' (it does not execute all the tests in
> > case of failure)
> > 3. I have compared version of subunit, oslo.service, oslo.log, tox with
> > failed [3] and passed [4] logs and they are same.
> > 4. I came across a bug in cinder [5] which talks about excessive logs might
> > cause this error and I have tried to ignore the deprecation warning in
> > tox.ini by adding 'PYTHONWARNINGS=ignore::DeprecationWarning' but it still
> > shows DeprecationWarning in the logs.
> > 5. I have tried executing tests by setting '--concurrency' to 1 but still
> > the error persists.
> > 6. Whenver subunit.parser error occurs the tests remains 'inprogress' state
> > and sometimes time-outs.
> > 
> > I have no clues so far to tackle this issue, kindly provide your inputs for
> > the same.
> 
> Usually I like to start with the subunit data itself [6]. You'll want to download this file locally, then unzip it. At this point it is in subunitv2 protocol format which is binary and not very human readable. If you install the python subunit package you get a tool called `subunit-2to1` that you can run on this file to get v1 protocol format which is human readable.
> 
> Doing this we get:
> 
> tags: worker-3
> test: subunit.parser
> time: 2018-12-13 03:36:22.466478Z
> failure: subunit.parser [ multipart
> Content-Type: application/octet-stream
> Packet data
> 6
> �+p�v�0
> Content-Type: text/plain;charset=utf8
> Parser Error
> 31
> Short read - got 65530 bytes, wanted 423557 bytes0
> ]
> tags: -worker-3
> time: 2018-12-13 03:36:17.372762Z
> 
> The control process got fewer bytes from worker-3 than it expected. One reason this could happen is the worker-3 test process is crashing before writing all the bytes, but there is at least one test on worker-3 that reports after this parse error so that may not be the cause.
> 
> Another potential cause is subunit + testr relies on stdout and subunt + stestr relies on tcp socket (I think) and it could be that that stream is being closed by side effects in a test?

stestr doesn't use a tcp socket for the subunit streams for the workers, it
also uses stdout from subprocess. That code is actually **mostly** unchanged
from testr since the actual fork. I have a neglected work in progress branch
adding support for trying to use os.pipe an multiprocessing instead of wrapping
the subunit.run calls in subprocess here:

https://github.com/mtreinish/stestr/commit/ed390dfcea6d7cfe40908ea4dd60d9f27e5ec28c

in an effort to try and making the worker schedule more dynamic. But, that
hasn't merged, and even if it ever does it'll be opt-in for a while.

> 
> To narrow this down I would use process of elimination to identify suspect test cases. Basically any test that doesn't show up in that subunit log could be causing this. Then run them in isolation to look for side effect causing errors. You may also want to cross check against several failures which may help narrow it down further.

I agree with everything clarkb said here and the first step is to look at the
actual subunit output to see where things are going wrong. The other thing you
ishould try is running the test suite fully with subunit.run by itself in an
environment where this is failing to get the subunit stream without stestr
consuming it. If that fails then that gives us a good idea where to look. If
not we can take that stream as a good example to compare against.

This does sound very similar to the bug with the cinder unit tests that we
were hitting earlier this year (which we never found a root cause or fix of).
That was caused by having very large attachments in the subunit stream. (in the
common case these are stdout, stderr, and python logging) When the attachments
reached a certain size in the subunit stream the parser would error and cause
the job to fail and all remaining tests on that worker to not run and/or had
weird follow on failures (it's been a while so I don't remeber the exact
behavior) because subunit.run failed. The way we worked around the issue in
cinder was to stop writing such large attachments during the test run, which in
that case was to stop writing tons of text to stdout during the tests. None of
the things described here actually does this, so it might be worth trying to
decrease the attachment size.

-Matt Treinish

> 
> > 
> > 
> > [1]
> > http://logs.openstack.org/89/617889/4/check/openstack-tox-functional-py35/4c79eb0/testr_results.html.gz
> > [2]
> > http://logs.openstack.org/89/617889/4/check/openstack-tox-functional-py35/39f2c7e/testr_results.html.gz
> > [3]
> > http://logs.openstack.org/89/617889/4/check/openstack-tox-functional-py35/4c79eb0/job-output.txt.gz
> > [4]
> > http://logs.openstack.org/89/617889/4/check/openstack-tox-functional-py35/39f2c7e/job-output.txt.gz
> > [5] https://bugs.launchpad.net/cinder/+bug/1728640
> 
> [6] http://logs.openstack.org/89/617889/4/check/openstack-tox-functional-py35/4c79eb0/testrepository.subunit.gz
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20181213/9e64b263/attachment.sig>


More information about the openstack-discuss mailing list