[openstack-dev] [oslo.db] [release] opportunistic tests breaking randomly

Roman Podoliaka rpodolyaka at mirantis.com
Thu Sep 15 13:20:27 UTC 2016


Sean,

So currently we have a default timeout of 160s in Nova. And
specifically for migration tests we set a scaling factor of 2. Let's
maybe give 2.5 or 3 a try ( https://review.openstack.org/#/c/370805/ )
and make a couple of "rechecks" to see if it helps or not.

In Ocata we could revisit the migrations collapse again to reduce the
number of scripts.

On the testing side, we could probably "cheat" a bit to trade data
safety for performance. E.g. we could set "fsync = off" for PostgreSQL
(https://www.postgresql.org/docs/9.2/static/runtime-config-wal.html).
Similar settings must be available for MySQL as well.

Thanks,
Roman

On Thu, Sep 15, 2016 at 3:07 PM, Sean Dague <sean at dague.net> wrote:
> On 09/15/2016 05:52 AM, Roman Podoliaka wrote:
>> Mike,
>>
>> On Thu, Sep 15, 2016 at 5:48 AM, Mike Bayer <mbayer at redhat.com> wrote:
>>
>>> * Prior to oslo.db 4.13.3, did we ever see this "timeout" condition occur?
>>> If so, was it also accompanied by the same "resource closed" condition or
>>> did this second part of the condition only appear at 4.13.3?
>>> * Did we see a similar "timeout" / "resource closed" combination prior to
>>> 4.13.3, just with less frequency?
>>
>> I believe we did -
>> https://bugs.launchpad.net/openstack-ci/+bug/1216851 , although we
>> used mysql-python back then, so the error was slightly different.
>>
>>> * What is the magnitude of the "timeout" this fixture is using, is it on the
>>> order of seconds, minutes, hours?
>>
>> It's set in seconds per project in .testr.conf, e.g.:
>>
>> https://github.com/openstack/nova/blob/master/.testr.conf
>> https://github.com/openstack/ironic/blob/master/.testr.conf
>>
>> In Nova we also have a 'timeout scaling factor' specifically set for
>> migration tests:
>>
>> https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L67
>>
>>> * If many minutes or hours, can the test suite be observed to be stuck on
>>> this test?   Has someone tried to run a "SHOW PROCESSLIST" while this
>>> condition is occurring to see what SQL is pausing?
>>
>> We could try to do that in the gate, but I don't expect to see
>> anything interesting: IMO, we'd see regular queries that should have
>> been executed fast, but actually took much longer time (presumably due
>> to heavy disk IO caused by multiple workers running similar tests in
>> parallel).
>>
>>> * Is this failure only present within the Nova test suite or has it been
>>> observed in the test suites of other projects?
>>
>> According to
>>
>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.ResourceClosedError%5C%22
>>
>> it's mostly Nova, but this has also been observed in Ironic, Manila
>> and Ceilometer. Ironic and Manila have OS_TEST_TIMEOUT value set to 60
>> seconds.
>>
>>> * Is this failure present only on the "database migration" test suite or is
>>> it present in other opportunistic tests, for Nova and others?
>>
>> Based on the console logs I've checked only migration tests failed,
>> but that's probably due to the fact that they are usually the slowest
>> ones (again, presumably due to heavy disk IO).
>>
>>> * Have there been new database migrations added to Nova which are being
>>> exercised here and may be involved?
>>
>> Looks like there were no changes recently:
>>
>> https://review.openstack.org/#/q/project:openstack/nova+status:merged+branch:master+(file:%22%255Enova/db/sqlalchemy/migrate_repo/.*%2524%22+OR+file:%22%255Enova/tests/unit/db/test_migrations.py%2524%22)
>>
>>> I'm not sure how much of an inconvenience it is to downgrade oslo.db. If
>>> downgrading it is feasible, that would at least be a way to eliminate it as
>>> a possibility if these same failures continue to occur, or a way to confirm
>>> its involvement if they disappear.   But if downgrading is disruptive then
>>> there are other things to look at in order to have a better chance at
>>> predicting its involvement.
>>
>> I don't think we need to block oslo.db 4.13.3, unless we clearly see
>> it's this version that causes these failures.
>>
>> I gave version 4.11 (before changes to provisioning) a try on my local
>> machine and see the very same errors when concurrency level is high (
>> http://paste.openstack.org/show/577350/ ), so I don't think the latest
>> oslo.db release has anything to do with the increase of the number of
>> failures on CI.
>>
>> My current understanding is that the load on gate nodes somehow
>> increased (either we run more testr workers in parallel now or
>> apply/test more migrations or just more run VMs per host or the gate
>> is simply busy at this point of the release cycle), so that we started
>> to see these timeouts more often.
>
> The migration count is definitely going to grow over time, as is the
> nature of the beast. Nova hasn't had a migration collapse in quite a
> while. The higher patch volume in Nova and larger number of db
> migrations could definitely account for Nova being higher here.
>
> Is there a better timeout value that you think will make these timeouts
> happen less often?
>
>         -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list