[openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Ihar Hrachyshka ihrachys at redhat.com
Wed Sep 21 14:46:09 UTC 2016


I just hit that TimeoutException error in neutron functional tests:

http://logs.openstack.org/68/373868/4/check/gate-neutron-dsvm-functional-ubuntu-trusty/4de275e/testr_results.html.gz

It’s a bit weird that we hit that 180 sec timeout because in good runs, the  
test takes ~5 secs.

Do we have a remedy against that kind of failure? I saw nova bumped the  
timeout length for the tests. Is it the approach we should apply across the  
board for other projects?

Ihar

Zane Bitter <zbitter at redhat.com> wrote:

> On 14/09/16 11:44, Mike Bayer wrote:
>> On 09/14/2016 11:08 AM, Mike Bayer wrote:
>>> On 09/14/2016 09:15 AM, Sean Dague wrote:
>>>> I noticed the following issues happening quite often now in the
>>>> opportunistic db tests for nova -
>>>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
>>>>
>>>>
>>>>
>>>>
>>>> It looks like some race has been introduced where the various db
>>>> connections are not fully isolated from each other like they used to be.
>>>> The testing magic for this is buried pretty deep in oslo.db.
>>>
>>> that error message occurs when a connection that is intended against a
>>> SELECT statement fails to provide a cursor.description attribute.  It is
>>> typically a driver-level bug in the MySQL world and corresponds to
>>> mis-handled failure modes from the MySQL connection.
>>>
>>> By "various DB connections are not fully isolated from each other" are
>>> you suggesting that a single in-Python connection object itself is being
>>> shared among multiple greenlets?   I'm not aware of a change in oslo.db
>>> that would be a relationship to such an effect.
>>
>> So, I think by "fully isolated from each other" what you really mean is
>> "operations upon a connection are not fully isolated from the subsequent
>> use of that connection", since that's what I see in the logs.  A
>> connection is attempting to be used during teardown to drop tables,
>> however it's in this essentially broken state from a PyMySQL
>> perspective, which would indicate something has gone wrong with this
>> (pooled) connection in the preceding test that could not be detected or
>> reverted once the connection was returned to the pool.
>>
>> From Roman's observation, it looks like a likely source of this
>> corruption is a timeout that is interrupting the state of the PyMySQL
>> connection.   In the preceding stack trace, PyMySQL is encountering a
>> raise as it attempts to call "self._sock.recv_into(b)", and it seems
>> like some combination of eventlet's response to signals and the
>> fixtures.Timeout() fixture is the cause of this interruption.   As an
>> additional wart, something else is getting involved and turning it into
>> an IndexError, I'm not sure what that part is yet though I can imagine
>> that might be SQLAlchemy mis-interpreting what it expects to be a
>> PyMySQL exception class, since we normally look inside of
>> exception.args[0] to get the MySQL error code.   With a blank exception
>> like fixtures.TimeoutException, .args is the empty tuple.
>>
>> The PyMySQL connection is now in an invalid state and unable to perform
>> a SELECT statement correctly, but the connection is not invalidated and
>> is instead returned to the connection pool in a broken state.  So the
>> subsequent teardown, if it uses this same connection (which is likely),
>> fails because the connection has been interrupted in the middle of its
>> work and not given the chance to clean up.
>>
>> Seems like the use of fixtures.Timeout() fixture here is not organized
>> to work with a database operation in progress, especially an
>> eventlet-monkeypatched PyMySQL.   Ideally, if something like a timeout
>> due to a signal handler occurs, the entire connection pool should be
>> disposed (quickest way, engine.dispose()), or at the very least (and
>> much more targeted), the connection that's involved should be
>> invalidated from the pool, e.g. connection.invalidate().
>>
>> The change to the environment here would be that this timeout is
>> happening at all - the reason for that is not yet known.   If oslo.db's
>> version were involved in this error, I would guess that it would be
>> related to this timeout condition being caused, and not anything to do
>> with the connection provisioning.
>>
>>>> Olso.db 4.13.3 did hit the scene about the time this showed up. So I
>>>> think we need to strongly consider blocking it and revisiting these
>>>> issues post newton.
>
> We've been seeing similar errors in Heat since at least Liberty  
> (https://bugs.launchpad.net/heat/+bug/1499669). Mike and I did some  
> poking around yesterday and basically confirmed his theory above. If you  
> look at the PyMySQL code, it believes that only an IOError can occur  
> while writing to a socket, so it has no handling for any other type of  
> exception, thus it can't deal with signal handlers raising exceptions or  
> other exceptions being thrown into the greenthread by eventlet. It sounds  
> like sqlalchemy also fails to catch at least some of these exceptions and  
> invalidate the connection.
>
> tl;dr this appears to have been around forever (at least since we  
> switched to using a pure-Python MySQL client) and is almost certainly  
> completely unrelated to any particular release of oslo.db.
>
> cheers,
> Zane.
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev





More information about the OpenStack-dev mailing list