[openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Roman Podoliaka rpodolyaka at mirantis.com
Thu Sep 15 13:08:27 UTC 2016


Mike,

I think the exact error (InterfaceError vs TimeoutException) varies
depending on what code is being executed at the very moment when a
process receives SIGALRM.

I tried to run the tests against PostgreSQL passing very small timeout
values (OS_TEST_TIMEOUT=5 python -m testtools.run
nova.tests.unit.db.test_migrations.TestNovaMigrationsPostgreSQL.test_walk_versions)
and saw both InterfaceError and TimeoutException:

http://paste.openstack.org/show/577410/
http://paste.openstack.org/show/577411/

strace'ing shows that a connection to PostgreSQL is closed right after
SIGARLM is handled:

http://paste.openstack.org/show/577425/

I tried to reproduce that manually by the means of gdb and set a
breakpoint on close():

http://paste.openstack.org/show/577422/

^ looks like psycopg2 closes the connection automatically if a query
was interrupted by SIGALRM.

The corresponding Python level backtrace is:

http://paste.openstack.org/show/577423/

^ i.e. connection closing happens in the middle of cursor.execute() call.

In the end I see a similar InterfaceError:

http://paste.openstack.org/show/577424/

That being said, this does not explain the "DB is in use" part.

Thanks,
Roman

On Thu, Sep 15, 2016 at 6:05 AM, Mike Bayer <mbayer at redhat.com> wrote:
> There's a different set of logs attached to the launchpad issue, that's not
> what I was looking at before.
>
> These logs are at
> http://logs.openstack.org/90/369490/1/check/gate-nova-tox-db-functional-ubuntu-xenial/085ac3e/console.html#_2016-09-13_14_54_18_098031
> .    In these logs, I see something *very* different, not just the MySQL
> tests but the Postgresql tests are definitely hitting conflicts against the
> randomly generated database.
>
> This set of traces, e.g.:
>
> sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) database
> "dbzrtmgbxv" is being accessed by other users
> 2016-09-13 14:54:18.093723 |     DETAIL:  There is 1 other session using the
> database.
> 2016-09-13 14:54:18.093736 |      [SQL: 'DROP DATABASE dbzrtmgbxv']
>
> and
>
> File
> "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py",
> line 668, in _rollback_impl
> 2016-09-13 14:54:18.095470 |
> self.engine.dialect.do_rollback(self.connection)
> 2016-09-13 14:54:18.095513 |       File
> "/home/jenkins/workspace/gate-nova-tox-db-functional-ubuntu-xenial/.tox/functional/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py",
> line 420, in do_rollback
> 2016-09-13 14:54:18.095526 |         dbapi_connection.rollback()
> 2016-09-13 14:54:18.095548 |     sqlalchemy.exc.InterfaceError:
> (psycopg2.InterfaceError) connection already closed
>
> are a very different animal. For one thing, they're on Postgresql where the
> driver and DB acts extremely rationally.   For another, there's no timeout
> exception here, and not all the conflicts are within the teardown.
>
> Are *these* errors also new as of version 4.13.3 of oslo.db ?   Because here
> I have more suspicion of one particular oslo.db change here.
>
>
>
>
>
>
>
>  fits much more with your initial description
>
>
> On 09/14/2016 10:48 PM, Mike Bayer wrote:
>>
>>
>>
>> On 09/14/2016 07:04 PM, Alan Pevec wrote:
>>>>
>>>> Olso.db 4.13.3 did hit the scene about the time this showed up. So I
>>>> think we need to strongly consider blocking it and revisiting these
>>>> issues post newton.
>>>
>>>
>>> So that means reverting all stable/newton changes, previous 4.13.x
>>> have been already blocked https://review.openstack.org/365565
>>> How would we proceed, do we need to revert all backport on stable/newton?
>>
>>
>> In case my previous email wasn't clear, I don't *yet* see evidence that
>> the recent 4.13.3 release of oslo.db is the cause of this problem.
>> However, that is only based upon what I see in this stack trace, which
>> is that the test framework is acting predictably (though erroneously)
>> based on the timeout condition which is occurring.   I don't (yet) see a
>> reason that the same effect would not occur prior to 4.13.3 in the face
>> of a signal pre-empting the work of the pymysql driver mid-stream.
>> However, this assumes that the timeout condition itself is not a product
>> of the current oslo.db version and that is not known yet.
>>
>> There's a list of questions that should all be answerable which could
>> assist in giving some hints towards this.
>>
>> There's two parts to the error in the logs.  There's the "timeout"
>> condition, then there is the bad reaction of the PyMySQL driver and the
>> test framework as a result of the operation being interrupted within the
>> test.
>>
>> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition
>> occur?   If so, was it also accompanied by the same "resource closed"
>> condition or did this second part of the condition only appear at 4.13.3?
>>
>> * Did we see a similar "timeout" / "resource closed" combination prior
>> to 4.13.3, just with less frequency?
>>
>> * Was the version of PyMySQL also recently upgraded (I'm assuming this
>> environment has been on PyMySQL for a long time at this point) ?   What
>> was the version change if so?  Especially if we previously saw "timeout"
>> but no "resource closed", perhaps an older version pf PyMySQL didn't
>> react in this way?
>>
>> * Was the version of MySQL running in the CI environment changed?   What
>> was the version change if so?    Were there any configurational changes
>> such as transaction isolation, memory or process settings?
>>
>> * Have there been changes to the "timeout" logic itself in the test
>> suite, e.g. whatever it is that sets up fixtures.Timeout()?  Or some
>> change that alters how teardown of tests occurs when a test is
>> interrupted via this timeout?
>>
>> * What is the magnitude of the "timeout" this fixture is using, is it on
>> the order of seconds, minutes, hours?
>>
>> * If many minutes or hours, can the test suite be observed to be stuck
>> on this test?   Has someone tried to run a "SHOW PROCESSLIST" while this
>> condition is occurring to see what SQL is pausing?
>>
>> * Has there been some change such that the migration tests are running
>> against non-empty tables or tables with much more data than was present
>> before?
>>
>> * Is this failure only present within the Nova test suite or has it been
>> observed in the test suites of other projects?
>>
>> * Is this failure present only on the "database migration" test suite or
>> is it present in other opportunistic tests, for Nova and others?
>>
>> * Have there been new database migrations added to Nova which are being
>> exercised here and may be involved?
>>
>> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If
>> downgrading it is feasible, that would at least be a way to eliminate it
>> as a possibility if these same failures continue to occur, or a way to
>> confirm its involvement if they disappear.   But if downgrading is
>> disruptive then there are other things to look at in order to have a
>> better chance at predicting its involvement.
>>
>>
>>
>>>
>>> Cheers,
>>> Alan
>>>
>>>
>>> __________________________________________________________________________
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list