[glance] functional-py35 job fails due to subunit.praser error

Slawomir Kaplonski skaplons at redhat.com
Thu Dec 13 20:24:52 UTC 2018


Hi,

We are facing same issue in Neutron now. Since few weeks we are trying to switch our functional job to Python 3 [1] and we had same issues. There was usually run only around 500 tests (in „good” run with py27 it’s more than 1000) and then fail with „subunit.parser error”.
Now I limited to disable (almost) all python warnings, change some strings to be logged instead of printed on stdout and I finally managed to run all tests without this error.
I think that we will have to live with such workaround for now as we don’t know real root cause of this issue.

[1] https://review.openstack.org/#/c/577383/

— 
Slawek Kaplonski
Senior software engineer
Red Hat

> Wiadomość napisana przez Matthew Treinish <mtreinish at kortar.org> w dniu 13.12.2018, o godz. 21:05:
> 
> On Thu, Dec 13, 2018 at 10:01:06AM -0800, Clark Boylan wrote:
>> On Thu, Dec 13, 2018, at 9:38 AM, Abhishek Kekane wrote:
>>> Hi All,
>>> 
>>> From 7th December 2018, all of sudden functional-py35 check job is failing
>>> on current master. Earlier this job was failing intermittently (2-3 times
>>> from 100). I have tried to dig into the issue and find out that it is
>>> failing obly for ubuntu and working fine with Fedora, centos etc. Below are
>>> some of the findings:
>>> 
>>> 1. Even if I add failure tests [1] in black-list file every time other
>>> tests reported as failure.
>>> 2. When I compared failed tests results [1] with passed tests results [2] I
>>> have found that passing tests shows 'Pass 741 Skip 7' whereas failing tests
>>> shows 'Pass 627 Failure 9 Skip 5' (it does not execute all the tests in
>>> case of failure)
>>> 3. I have compared version of subunit, oslo.service, oslo.log, tox with
>>> failed [3] and passed [4] logs and they are same.
>>> 4. I came across a bug in cinder [5] which talks about excessive logs might
>>> cause this error and I have tried to ignore the deprecation warning in
>>> tox.ini by adding 'PYTHONWARNINGS=ignore::DeprecationWarning' but it still
>>> shows DeprecationWarning in the logs.
>>> 5. I have tried executing tests by setting '--concurrency' to 1 but still
>>> the error persists.
>>> 6. Whenver subunit.parser error occurs the tests remains 'inprogress' state
>>> and sometimes time-outs.
>>> 
>>> I have no clues so far to tackle this issue, kindly provide your inputs for
>>> the same.
>> 
>> Usually I like to start with the subunit data itself [6]. You'll want to download this file locally, then unzip it. At this point it is in subunitv2 protocol format which is binary and not very human readable. If you install the python subunit package you get a tool called `subunit-2to1` that you can run on this file to get v1 protocol format which is human readable.
>> 
>> Doing this we get:
>> 
>> tags: worker-3
>> test: subunit.parser
>> time: 2018-12-13 03:36:22.466478Z
>> failure: subunit.parser [ multipart
>> Content-Type: application/octet-stream
>> Packet data
>> 6
>> �+p�v�0
>> Content-Type: text/plain;charset=utf8
>> Parser Error
>> 31
>> Short read - got 65530 bytes, wanted 423557 bytes0
>> ]
>> tags: -worker-3
>> time: 2018-12-13 03:36:17.372762Z
>> 
>> The control process got fewer bytes from worker-3 than it expected. One reason this could happen is the worker-3 test process is crashing before writing all the bytes, but there is at least one test on worker-3 that reports after this parse error so that may not be the cause.
>> 
>> Another potential cause is subunit + testr relies on stdout and subunt + stestr relies on tcp socket (I think) and it could be that that stream is being closed by side effects in a test?
> 
> stestr doesn't use a tcp socket for the subunit streams for the workers, it
> also uses stdout from subprocess. That code is actually **mostly** unchanged
> from testr since the actual fork. I have a neglected work in progress branch
> adding support for trying to use os.pipe an multiprocessing instead of wrapping
> the subunit.run calls in subprocess here:
> 
> https://github.com/mtreinish/stestr/commit/ed390dfcea6d7cfe40908ea4dd60d9f27e5ec28c
> 
> in an effort to try and making the worker schedule more dynamic. But, that
> hasn't merged, and even if it ever does it'll be opt-in for a while.
> 
>> 
>> To narrow this down I would use process of elimination to identify suspect test cases. Basically any test that doesn't show up in that subunit log could be causing this. Then run them in isolation to look for side effect causing errors. You may also want to cross check against several failures which may help narrow it down further.
> 
> I agree with everything clarkb said here and the first step is to look at the
> actual subunit output to see where things are going wrong. The other thing you
> ishould try is running the test suite fully with subunit.run by itself in an
> environment where this is failing to get the subunit stream without stestr
> consuming it. If that fails then that gives us a good idea where to look. If
> not we can take that stream as a good example to compare against.
> 
> This does sound very similar to the bug with the cinder unit tests that we
> were hitting earlier this year (which we never found a root cause or fix of).
> That was caused by having very large attachments in the subunit stream. (in the
> common case these are stdout, stderr, and python logging) When the attachments
> reached a certain size in the subunit stream the parser would error and cause
> the job to fail and all remaining tests on that worker to not run and/or had
> weird follow on failures (it's been a while so I don't remeber the exact
> behavior) because subunit.run failed. The way we worked around the issue in
> cinder was to stop writing such large attachments during the test run, which in
> that case was to stop writing tons of text to stdout during the tests. None of
> the things described here actually does this, so it might be worth trying to
> decrease the attachment size.
> 
> -Matt Treinish
> 
>> 
>>> 
>>> 
>>> [1]
>>> http://logs.openstack.org/89/617889/4/check/openstack-tox-functional-py35/4c79eb0/testr_results.html.gz
>>> [2]
>>> http://logs.openstack.org/89/617889/4/check/openstack-tox-functional-py35/39f2c7e/testr_results.html.gz
>>> [3]
>>> http://logs.openstack.org/89/617889/4/check/openstack-tox-functional-py35/4c79eb0/job-output.txt.gz
>>> [4]
>>> http://logs.openstack.org/89/617889/4/check/openstack-tox-functional-py35/39f2c7e/job-output.txt.gz
>>> [5] https://bugs.launchpad.net/cinder/+bug/1728640
>> 
>> [6] http://logs.openstack.org/89/617889/4/check/openstack-tox-functional-py35/4c79eb0/testrepository.subunit.gz




More information about the openstack-discuss mailing list