[neutron][all] OOM killer on coverage job

29 May 2024

      Hi,

Neutron has been having issues with our coverage gate job triggering the 
OOM killer since last week [0], which I just confirmed by holding a node 
and looking in the logs. It started happening after the sqlalchemy 2.0 
bump [1], but that just might be exposing the underlying issue.

Running locally I can see via /proc/meminfo that memory is getting consumed:

MemTotal:        8123628 kB
MemFree:         1108404 kB

And via ps it's the coverage processes doing it:

        PID   %MEM             RSS       PPID       TIME     NLWP 
WCHAN                     COMMAND

       4315   30.9         2516348       4314   01:29:07        1 
-                         /opt/stack/neutron/.tox/cover/bin/python 
/opt/stack/neutron/.tox/cover/bin/coverage run --source neutron 
--parallel-mode -m stestr.subunit_runner.run discover -t ./ 
./neutron/tests/unit --load-list /tmp/tmp0rhqfwhz
       4313   30.0         2437500       4312   01:28:50        1 
-                         /opt/stack/neutron/.tox/cover/bin/python 
/opt/stack/neutron/.tox/cover/bin/coverage run --source neutron 
--parallel-mode -m stestr.subunit_runner.run discover -t ./ 
./neutron/tests/unit --load-list /tmp/tmpfzmqyuub

(and the test hasn't even finished yet)

Only workaround seems to be reducing concurrency [2].

Have any other projects seen anything similar?

(and sorry for the html email)

-Brian

[0] https://bugs.launchpad.net/neutron/+bug/2065821
[1] https://review.opendev.org/c/openstack/requirements/+/879743

[2] https://review.opendev.org/c/openstack/neutron/+/920766

[neutron][all] OOM killer on coverage job

Brian Haley