[openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Sean Dague sean at dague.net
Thu Sep 15 12:07:09 UTC 2016


On 09/15/2016 05:52 AM, Roman Podoliaka wrote:
> Mike,
> 
> On Thu, Sep 15, 2016 at 5:48 AM, Mike Bayer <mbayer at redhat.com> wrote:
> 
>> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur?
>> If so, was it also accompanied by the same "resource closed" condition or
>> did this second part of the condition only appear at 4.13.3?
>> * Did we see a similar "timeout" / "resource closed" combination prior to
>> 4.13.3, just with less frequency?
> 
> I believe we did -
> https://bugs.launchpad.net/openstack-ci/+bug/1216851 , although we
> used mysql-python back then, so the error was slightly different.
> 
>> * What is the magnitude of the "timeout" this fixture is using, is it on the
>> order of seconds, minutes, hours?
> 
> It's set in seconds per project in .testr.conf, e.g.:
> 
> https://github.com/openstack/nova/blob/master/.testr.conf
> https://github.com/openstack/ironic/blob/master/.testr.conf
> 
> In Nova we also have a 'timeout scaling factor' specifically set for
> migration tests:
> 
> https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L67
> 
>> * If many minutes or hours, can the test suite be observed to be stuck on
>> this test?   Has someone tried to run a "SHOW PROCESSLIST" while this
>> condition is occurring to see what SQL is pausing?
> 
> We could try to do that in the gate, but I don't expect to see
> anything interesting: IMO, we'd see regular queries that should have
> been executed fast, but actually took much longer time (presumably due
> to heavy disk IO caused by multiple workers running similar tests in
> parallel).
> 
>> * Is this failure only present within the Nova test suite or has it been
>> observed in the test suites of other projects?
> 
> According to
> 
> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
> 
> it's mostly Nova, but this has also been observed in Ironic, Manila
> and Ceilometer. Ironic and Manila have OS_TEST_TIMEOUT value set to 60
> seconds.
> 
>> * Is this failure present only on the "database migration" test suite or is
>> it present in other opportunistic tests, for Nova and others?
> 
> Based on the console logs I've checked only migration tests failed,
> but that's probably due to the fact that they are usually the slowest
> ones (again, presumably due to heavy disk IO).
> 
>> * Have there been new database migrations added to Nova which are being
>> exercised here and may be involved?
> 
> Looks like there were no changes recently:
> 
> https://review.openstack.org/#/q/project:openstack/nova+status:merged+branch:master+(file:%22%255Enova/db/sqlalchemy/migrate_repo/.*%2524%22+OR+file:%22%255Enova/tests/unit/db/test_migrations.py%2524%22)
> 
>> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If
>> downgrading it is feasible, that would at least be a way to eliminate it as
>> a possibility if these same failures continue to occur, or a way to confirm
>> its involvement if they disappear.   But if downgrading is disruptive then
>> there are other things to look at in order to have a better chance at
>> predicting its involvement.
> 
> I don't think we need to block oslo.db 4.13.3, unless we clearly see
> it's this version that causes these failures.
> 
> I gave version 4.11 (before changes to provisioning) a try on my local
> machine and see the very same errors when concurrency level is high (
> http://paste.openstack.org/show/577350/ ), so I don't think the latest
> oslo.db release has anything to do with the increase of the number of
> failures on CI.
> 
> My current understanding is that the load on gate nodes somehow
> increased (either we run more testr workers in parallel now or
> apply/test more migrations or just more run VMs per host or the gate
> is simply busy at this point of the release cycle), so that we started
> to see these timeouts more often.

The migration count is definitely going to grow over time, as is the
nature of the beast. Nova hasn't had a migration collapse in quite a
while. The higher patch volume in Nova and larger number of db
migrations could definitely account for Nova being higher here.

Is there a better timeout value that you think will make these timeouts
happen less often?

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list