[openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Roman Podoliaka rpodolyaka at mirantis.com
Thu Sep 15 09:52:42 UTC 2016


Mike,

On Thu, Sep 15, 2016 at 5:48 AM, Mike Bayer <mbayer at redhat.com> wrote:

> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur?
> If so, was it also accompanied by the same "resource closed" condition or
> did this second part of the condition only appear at 4.13.3?
> * Did we see a similar "timeout" / "resource closed" combination prior to
> 4.13.3, just with less frequency?

I believe we did -
https://bugs.launchpad.net/openstack-ci/+bug/1216851 , although we
used mysql-python back then, so the error was slightly different.

> * What is the magnitude of the "timeout" this fixture is using, is it on the
> order of seconds, minutes, hours?

It's set in seconds per project in .testr.conf, e.g.:

https://github.com/openstack/nova/blob/master/.testr.conf
https://github.com/openstack/ironic/blob/master/.testr.conf

In Nova we also have a 'timeout scaling factor' specifically set for
migration tests:

https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L67

> * If many minutes or hours, can the test suite be observed to be stuck on
> this test?   Has someone tried to run a "SHOW PROCESSLIST" while this
> condition is occurring to see what SQL is pausing?

We could try to do that in the gate, but I don't expect to see
anything interesting: IMO, we'd see regular queries that should have
been executed fast, but actually took much longer time (presumably due
to heavy disk IO caused by multiple workers running similar tests in
parallel).

> * Is this failure only present within the Nova test suite or has it been
> observed in the test suites of other projects?

According to

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22

it's mostly Nova, but this has also been observed in Ironic, Manila
and Ceilometer. Ironic and Manila have OS_TEST_TIMEOUT value set to 60
seconds.

> * Is this failure present only on the "database migration" test suite or is
> it present in other opportunistic tests, for Nova and others?

Based on the console logs I've checked only migration tests failed,
but that's probably due to the fact that they are usually the slowest
ones (again, presumably due to heavy disk IO).

> * Have there been new database migrations added to Nova which are being
> exercised here and may be involved?

Looks like there were no changes recently:

https://review.openstack.org/#/q/project:openstack/nova+status:merged+branch:master+(file:%22%255Enova/db/sqlalchemy/migrate_repo/.*%2524%22+OR+file:%22%255Enova/tests/unit/db/test_migrations.py%2524%22)

> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If
> downgrading it is feasible, that would at least be a way to eliminate it as
> a possibility if these same failures continue to occur, or a way to confirm
> its involvement if they disappear.   But if downgrading is disruptive then
> there are other things to look at in order to have a better chance at
> predicting its involvement.

I don't think we need to block oslo.db 4.13.3, unless we clearly see
it's this version that causes these failures.

I gave version 4.11 (before changes to provisioning) a try on my local
machine and see the very same errors when concurrency level is high (
http://paste.openstack.org/show/577350/ ), so I don't think the latest
oslo.db release has anything to do with the increase of the number of
failures on CI.

My current understanding is that the load on gate nodes somehow
increased (either we run more testr workers in parallel now or
apply/test more migrations or just more run VMs per host or the gate
is simply busy at this point of the release cycle), so that we started
to see these timeouts more often.

Thanks,
Roman



More information about the OpenStack-dev mailing list