[openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Roman Podoliaka rpodolyaka at mirantis.com
Wed Sep 14 16:06:28 UTC 2016


Hmm, looks like we now run more testr workers in parallel (8 instead of 4):

http://logs.openstack.org/76/335676/7/check/gate-nova-python34-db/6841fce/console.html.gz
http://logs.openstack.org/62/369862/3/check/gate-nova-python27-db-ubuntu-xenial/2784de9/console.html

On my machine running Nova migration tests against MySQL is much
slower with 8 workers than with 4 due to disk IO (it's HDD). When they
time out (after 320s) I see the very same TimeoutException and
IndexError (probably something messes up with TimeoutException up the
stack).

On Wed, Sep 14, 2016 at 6:44 PM, Mike Bayer <mbayer at redhat.com> wrote:
>
>
> On 09/14/2016 11:08 AM, Mike Bayer wrote:
>>
>>
>>
>> On 09/14/2016 09:15 AM, Sean Dague wrote:
>>>
>>> I noticed the following issues happening quite often now in the
>>> opportunistic db tests for nova -
>>>
>>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
>>>
>>>
>>>
>>> It looks like some race has been introduced where the various db
>>> connections are not fully isolated from each other like they used to be.
>>> The testing magic for this is buried pretty deep in oslo.db.
>>
>>
>> that error message occurs when a connection that is intended against a
>> SELECT statement fails to provide a cursor.description attribute.  It is
>> typically a driver-level bug in the MySQL world and corresponds to
>> mis-handled failure modes from the MySQL connection.
>>
>> By "various DB connections are not fully isolated from each other" are
>> you suggesting that a single in-Python connection object itself is being
>> shared among multiple greenlets?   I'm not aware of a change in oslo.db
>> that would be a relationship to such an effect.
>
>
> So, I think by "fully isolated from each other" what you really mean is
> "operations upon a connection are not fully isolated from the subsequent use
> of that connection", since that's what I see in the logs.  A connection is
> attempting to be used during teardown to drop tables, however it's in this
> essentially broken state from a PyMySQL perspective, which would indicate
> something has gone wrong with this (pooled) connection in the preceding test
> that could not be detected or reverted once the connection was returned to
> the pool.
>
> From Roman's observation, it looks like a likely source of this corruption
> is a timeout that is interrupting the state of the PyMySQL connection.   In
> the preceding stack trace, PyMySQL is encountering a raise as it attempts to
> call "self._sock.recv_into(b)", and it seems like some combination of
> eventlet's response to signals and the fixtures.Timeout() fixture is the
> cause of this interruption.   As an additional wart, something else is
> getting involved and turning it into an IndexError, I'm not sure what that
> part is yet though I can imagine that might be SQLAlchemy mis-interpreting
> what it expects to be a PyMySQL exception class, since we normally look
> inside of exception.args[0] to get the MySQL error code.   With a blank
> exception like fixtures.TimeoutException, .args is the empty tuple.
>
> The PyMySQL connection is now in an invalid state and unable to perform a
> SELECT statement correctly, but the connection is not invalidated and is
> instead returned to the connection pool in a broken state.  So the
> subsequent teardown, if it uses this same connection (which is likely),
> fails because the connection has been interrupted in the middle of its work
> and not given the chance to clean up.
>
> Seems like the use of fixtures.Timeout() fixture here is not organized to
> work with a database operation in progress, especially an
> eventlet-monkeypatched PyMySQL.   Ideally, if something like a timeout due
> to a signal handler occurs, the entire connection pool should be disposed
> (quickest way, engine.dispose()), or at the very least (and much more
> targeted), the connection that's involved should be invalidated from the
> pool, e.g. connection.invalidate().
>
> The change to the environment here would be that this timeout is happening
> at all - the reason for that is not yet known.   If oslo.db's version were
> involved in this error, I would guess that it would be related to this
> timeout condition being caused, and not anything to do with the connection
> provisioning.
>
>
>
>
>
>>
>>
>>
>>>
>>> Olso.db 4.13.3 did hit the scene about the time this showed up. So I
>>> think we need to strongly consider blocking it and revisiting these
>>> issues post newton.
>>>
>>>     -Sean
>>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list