[neutron][all] OOM killer on coverage job

newer
Re: [tc][all][senlin] Retiring the...

Brian Haley

30 May 2024 30 May '24

1:50 a.m.

Hi, Neutron has been having issues with our coverage gate job triggering the OOM killer since last week [0], which I just confirmed by holding a node and looking in the logs. It started happening after the sqlalchemy 2.0 bump [1], but that just might be exposing the underlying issue. Running locally I can see via /proc/meminfo that memory is getting consumed: MemTotal: 8123628 kB MemFree: 1108404 kB And via ps it's the coverage processes doing it: PID %MEM RSS PPID TIME NLWP WCHAN COMMAND 4315 30.9 2516348 4314 01:29:07 1 - /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmp0rhqfwhz 4313 30.0 2437500 4312 01:28:50 1 - /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmpfzmqyuub (and the test hasn't even finished yet) Only workaround seems to be reducing concurrency [2]. Have any other projects seen anything similar? (and sorry for the html email) -Brian [0] https://bugs.launchpad.net/neutron/+bug/2065821 [1] https://review.opendev.org/c/openstack/requirements/+/879743 [2] https://review.opendev.org/c/openstack/neutron/+/920766

Show replies by date

Mike Bayer

30 May 30 May

2:21 a.m.

can you maybe try reducing / removing the use of the "subqueryload" loader strategy and replacing with "selectin" ? One of the most egregious patterns neutron has is excessive use of "subqueryload" which generates huge queries that are expensive to cache, expensive on the server, expensive to run, etc. the "selectinload" strategy, when I first added it (and it's now very mature) was mostly after observing how badly neutron relies on the very overwrought "subqueryload" queries. in theory, all subqueryload can be replaced with selectinload directly. however obviously I'd do this more carefully. On Wed, May 29, 2024, at 4:50 PM, Brian Haley wrote:

...

Hi,

Neutron has been having issues with our coverage gate job triggering the OOM killer since last week [0], which I just confirmed by holding a node and looking in the logs. It started happening after the sqlalchemy 2.0 bump [1], but that just might be exposing the underlying issue.

Running locally I can see via /proc/meminfo that memory is getting consumed:

MemTotal:        8123628 kB MemFree:         1108404 kB

And via ps it's the coverage processes doing it:

       PID   %MEM             RSS       PPID       TIME     NLWP WCHAN                     COMMAND

      4315   30.9         2516348       4314   01:29:07        1 -                         /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmp0rhqfwhz       4313   30.0         2437500       4312   01:28:50        1 -                         /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmpfzmqyuub

(and the test hasn't even finished yet)

Only workaround seems to be reducing concurrency [2].

Have any other projects seen anything similar?

(and sorry for the html email)

-Brian

[0] https://bugs.launchpad.net/neutron/+bug/2065821 [1] https://review.opendev.org/c/openstack/requirements/+/879743

[2] https://review.opendev.org/c/openstack/neutron/+/920766

Ihar Hrachyshka

5:30 a.m.

On Wed, May 29, 2024 at 5:26 PM Mike Bayer <mike_mp@zzzcomputing.com> wrote:

...

can you maybe try reducing / removing the use of the "subqueryload" loader strategy and replacing with "selectin" ? One of the most egregious patterns neutron has is excessive use of "subqueryload" which generates huge queries that are expensive to cache, expensive on the server, expensive to run, etc.

The "subqueryload" substring is only found in a single file (neutron/plugins/ml2/drivers/l2pop/db.py) in the neutron tree, two occurrences. I don't see it mentioned in neutron-lib anywhere either. Am I missing something?

...

the "selectinload" strategy, when I first added it (and it's now very mature) was mostly after observing how badly neutron relies on the very overwrought "subqueryload" queries.

in theory, all subqueryload can be replaced with selectinload directly. however obviously I'd do this more carefully.

On Wed, May 29, 2024, at 4:50 PM, Brian Haley wrote:

...
Hi,

Neutron has been having issues with our coverage gate job triggering the OOM killer since last week [0], which I just confirmed by holding a node and looking in the logs. It started happening after the sqlalchemy 2.0 bump [1], but that just might be exposing the underlying issue.

Running locally I can see via /proc/meminfo that memory is getting consumed:

MemTotal: 8123628 kB MemFree: 1108404 kB

And via ps it's the coverage processes doing it:

PID %MEM RSS PPID TIME NLWP WCHAN COMMAND

4315 30.9 2516348 4314 01:29:07 1 - /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmp0rhqfwhz 4313 30.0 2437500 4312 01:28:50 1 - /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmpfzmqyuub

(and the test hasn't even finished yet)

Only workaround seems to be reducing concurrency [2].

Have any other projects seen anything similar?

(and sorry for the html email)

-Brian

[0] https://bugs.launchpad.net/neutron/+bug/2065821 [1] https://review.opendev.org/c/openstack/requirements/+/879743

[2] https://review.opendev.org/c/openstack/neutron/+/920766

Mike Bayer

7:31 a.m.

On Wed, May 29, 2024, at 8:30 PM, Ihar Hrachyshka wrote:

...

On Wed, May 29, 2024 at 5:26 PM Mike Bayer <mike_mp@zzzcomputing.com> wrote:

...
can you maybe try reducing / removing the use of the "subqueryload" loader strategy and replacing with "selectin" ? One of the most egregious patterns neutron has is excessive use of "subqueryload" which generates huge queries that are expensive to cache, expensive on the server, expensive to run, etc.

The "subqueryload" substring is only found in a single file (neutron/plugins/ml2/drivers/l2pop/db.py) in the neutron tree, two occurrences. I don't see it mentioned in neutron-lib anywhere either. Am I missing something?

yes, the lazy setting as well: $ find neutron -name "*.py" -exec grep -H 'lazy="subquery"' {} \; neutron/db/models/allowed_address_pair.py: lazy="subquery", cascade="delete")) neutron/db/models/metering.py: cascade="delete", lazy="subquery") neutron/db/models_v2.py: lazy="subquery", neutron/db/models_v2.py: lazy="subquery") try changing to "selectin" for those. that's the default loading scheme for those attributes. then yes the two subqueryload calls in l2pop/db.py can be changed also. all of that said there shouldn't be a big difference between SQLA 1.4 and 2.0 as far as memory use of query structures. the big major difference going to 2.0 is that the whole "autocommit" notion goes away and you are always in a transaction block that needs to be explicitly ended.

...

...
the "selectinload" strategy, when I first added it (and it's now very mature) was mostly after observing how badly neutron relies on the very overwrought "subqueryload" queries.

in theory, all subqueryload can be replaced with selectinload directly. however obviously I'd do this more carefully.

On Wed, May 29, 2024, at 4:50 PM, Brian Haley wrote:

...
Hi,

Neutron has been having issues with our coverage gate job triggering the OOM killer since last week [0], which I just confirmed by holding a node and looking in the logs. It started happening after the sqlalchemy 2.0 bump [1], but that just might be exposing the underlying issue.

Running locally I can see via /proc/meminfo that memory is getting consumed:

MemTotal: 8123628 kB MemFree: 1108404 kB

And via ps it's the coverage processes doing it:

PID %MEM RSS PPID TIME NLWP WCHAN COMMAND

4315 30.9 2516348 4314 01:29:07 1 - /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmp0rhqfwhz 4313 30.0 2437500 4312 01:28:50 1 - /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmpfzmqyuub

(and the test hasn't even finished yet)

Only workaround seems to be reducing concurrency [2].

Have any other projects seen anything similar?

(and sorry for the html email)

-Brian

[0] https://bugs.launchpad.net/neutron/+bug/2065821 [1] https://review.opendev.org/c/openstack/requirements/+/879743

[2] https://review.opendev.org/c/openstack/neutron/+/920766

Brian Haley

9:08 p.m.

Hi Mike, On 5/29/24 10:31 PM, Mike Bayer wrote:

...

On Wed, May 29, 2024, at 8:30 PM, Ihar Hrachyshka wrote:

...
On Wed, May 29, 2024 at 5:26 PM Mike Bayer <mike_mp@zzzcomputing.com <mailto:mike_mp@zzzcomputing.com>> wrote:

can you maybe try reducing / removing the use of the "subqueryload" loader strategy and replacing with "selectin" ? One of the most egregious patterns neutron has is excessive use of "subqueryload" which generates huge queries that are expensive to cache, expensive on the server, expensive to run, etc.

The "subqueryload" substring is only found in a single file (neutron/plugins/ml2/drivers/l2pop/db.py) in the neutron tree, two occurrences. I don't see it mentioned in neutron-lib anywhere either. Am I missing something?

yes, the lazy setting as well:

$ find neutron -name "*.py" -exec grep -H 'lazy="subquery"' {} \; neutron/db/models/allowed_address_pair.py: lazy="subquery", cascade="delete")) neutron/db/models/metering.py: cascade="delete", lazy="subquery") neutron/db/models_v2.py: lazy="subquery", neutron/db/models_v2.py: lazy="subquery")

try changing to "selectin" for those. that's the default loading scheme for those attributes. then yes the two subqueryload calls in l2pop/db.py can be changed also.

Thanks for the info! I can create a patch changing these, irregardless of whether it helps with the memory issue. There were actually a lot more as we don't consistently use double quotes, i.e. lazy='subquery'. As I mentioned, the only reason I picked-on sqlalchemy was that reverting the bump made the problem go away. I still can't explain that unfortunately. -Brian

...

all of that said there shouldn't be a big difference between SQLA 1.4 and 2.0 as far as memory use of query structures. the big major difference going to 2.0 is that the whole "autocommit" notion goes away and you are always in a transaction block that needs to be explicitly ended.

...
the "selectinload" strategy, when I first added it (and it's now very mature) was mostly after observing how badly neutron relies on the very overwrought "subqueryload" queries.

in theory, all subqueryload can be replaced with selectinload directly. however obviously I'd do this more carefully.

On Wed, May 29, 2024, at 4:50 PM, Brian Haley wrote: > Hi, > > Neutron has been having issues with our coverage gate job triggering the > OOM killer since last week [0], which I just confirmed by holding a node > and looking in the logs. It started happening after the sqlalchemy 2.0 > bump [1], but that just might be exposing the underlying issue. > > Running locally I can see via /proc/meminfo that memory is getting consumed: > > MemTotal: 8123628 kB > MemFree: 1108404 kB > > > And via ps it's the coverage processes doing it: > > > PID %MEM RSS PPID TIME NLWP > WCHAN COMMAND > > 4315 30.9 2516348 4314 01:29:07 1 > - /opt/stack/neutron/.tox/cover/bin/python > /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron > --parallel-mode -m stestr.subunit_runner.run discover -t ./ > ./neutron/tests/unit --load-list /tmp/tmp0rhqfwhz > 4313 30.0 2437500 4312 01:28:50 1 > - /opt/stack/neutron/.tox/cover/bin/python > /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron > --parallel-mode -m stestr.subunit_runner.run discover -t ./ > ./neutron/tests/unit --load-list /tmp/tmpfzmqyuub > > (and the test hasn't even finished yet) > > > Only workaround seems to be reducing concurrency [2]. > > > Have any other projects seen anything similar? > > (and sorry for the html email) > > -Brian > > [0] https://bugs.launchpad.net/neutron/+bug/2065821 <https://bugs.launchpad.net/neutron/+bug/2065821> > [1] https://review.opendev.org/c/openstack/requirements/+/879743 <https://review.opendev.org/c/openstack/requirements/+/879743> > > [2] https://review.opendev.org/c/openstack/neutron/+/920766 <https://review.opendev.org/c/openstack/neutron/+/920766>

Clark Boylan

7:06 a.m.

On Wed, May 29, 2024, at 1:50 PM, Brian Haley wrote:

...

Hi,

Neutron has been having issues with our coverage gate job triggering the OOM killer since last week [0], which I just confirmed by holding a node and looking in the logs. It started happening after the sqlalchemy 2.0 bump [1], but that just might be exposing the underlying issue.

Running locally I can see via /proc/meminfo that memory is getting consumed:

MemTotal:        8123628 kB MemFree:         1108404 kB

And via ps it's the coverage processes doing it:

       PID   %MEM             RSS       PPID       TIME     NLWP WCHAN                     COMMAND

      4315   30.9         2516348       4314   01:29:07        1 -                         /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmp0rhqfwhz       4313   30.0         2437500       4312   01:28:50        1 -                         /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmpfzmqyuub

(and the test hasn't even finished yet)

Only workaround seems to be reducing concurrency [2].

Other things that came to mind are that maybe you are gathering coverage info for more files that necessary This isn't the case; --source neutron is passed and looking at coverage reports we can see no other sources are included. I also notice that upper-constraints for coverage is set to 7.5.1 but there is a (very recent) 7.5.3 release which claims to have some memory improvements [3] that may be worth trying. The code that was modified to improve memory use was introduced in 7.5.0 as well (if I've read git history properly anyway). Looking at requirements we jumped from 7.4.4 to 7.5.1 less than a week ago [4]. Depending on the timing of this new issue this may be more than coincidence.

...

Have any other projects seen anything similar?

(and sorry for the html email)

-Brian

[0] https://bugs.launchpad.net/neutron/+bug/2065821 [1] https://review.opendev.org/c/openstack/requirements/+/879743

[2] https://review.opendev.org/c/openstack/neutron/+/920766

[3] https://coverage.readthedocs.io/en/7.5.3/changes.html [4] https://review.opendev.org/c/openstack/requirements/+/920283/3/upper-constra...

Brian Haley

9:07 p.m.

Hi Clark, On 5/29/24 10:06 PM, Clark Boylan wrote:

...

On Wed, May 29, 2024, at 1:50 PM, Brian Haley wrote:

...
Hi,

Neutron has been having issues with our coverage gate job triggering the OOM killer since last week [0], which I just confirmed by holding a node and looking in the logs. It started happening after the sqlalchemy 2.0 bump [1], but that just might be exposing the underlying issue.

Running locally I can see via /proc/meminfo that memory is getting consumed:

MemTotal:        8123628 kB MemFree:         1108404 kB

And via ps it's the coverage processes doing it:

       PID   %MEM             RSS       PPID       TIME     NLWP WCHAN                     COMMAND

      4315   30.9         2516348       4314   01:29:07        1 -                         /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmp0rhqfwhz       4313   30.0         2437500       4312   01:28:50        1 -                         /opt/stack/neutron/.tox/cover/bin/python /opt/stack/neutron/.tox/cover/bin/coverage run --source neutron --parallel-mode -m stestr.subunit_runner.run discover -t ./ ./neutron/tests/unit --load-list /tmp/tmpfzmqyuub

(and the test hasn't even finished yet)

Only workaround seems to be reducing concurrency [2].

Other things that came to mind are that maybe you are gathering coverage info for more files that necessary This isn't the case; --source neutron is passed and looking at coverage reports we can see no other sources are included.

I also notice that upper-constraints for coverage is set to 7.5.1 but there is a (very recent) 7.5.3 release which claims to have some memory improvements [3] that may be worth trying. The code that was modified to improve memory use was introduced in 7.5.0 as well (if I've read git history properly anyway). Looking at requirements we jumped from 7.4.4 to 7.5.1 less than a week ago [4]. Depending on the timing of this new issue this may be more than coincidence.

Thanks for noticing the coverage bump, which happened around the same time, and might make more sense. I've started a run using 7.4.4 and if better will try with 7.5.3 as well. In my case coverage was 7.5.1 in both runs which is why I looked at other changes. -Brian

...

...
Have any other projects seen anything similar?

(and sorry for the html email)

-Brian

[0] https://bugs.launchpad.net/neutron/+bug/2065821 [1] https://review.opendev.org/c/openstack/requirements/+/879743

[2] https://review.opendev.org/c/openstack/neutron/+/920766

[3] https://coverage.readthedocs.io/en/7.5.3/changes.html [4] https://review.opendev.org/c/openstack/requirements/+/920283/3/upper-constra...

579

Age (days ago)

580

Last active (days ago)

List overview

Download

6 comments

4 participants

participants (4)

Brian Haley
Clark Boylan
Ihar Hrachyshka
Mike Bayer