Open Stack

Thu Jan 19 00:04:10 UTC 2017

On 01/14/2017 02:48 AM, Jakub Libosvar wrote:
> recently I noticed we got oom-killer in action in one of our jobs [1]. 

> Any other ideas?

I spent quite a while chasing down similar things with centos a while
ago.  I do have some ideas :)

The symptom is probably that mysql gets chosen by the OOM killer but
it's unlikely to be mysql's fault, it's just big and a good target.

If the system is going offline, I added the ability to turn on the
netconsole in devstack-gate with [1].  As the comment mentions, you
can put little tests that stream data in /dev/kmsg and they will
generally get off the host, even if ssh has been killed.  I found this
very useful for getting the initial oops data (i've used this several
times for other gate oopses, including other kernel issues we've
seen).

For starting to pin down what is really consuming the memory, the
first thing I did was wrote a peak-memory usage tracker that gave me
stats on memory growth during the devstack run [2].  You have to
enable this with "enable_service peakmem_tracker".  This starts to
give you the big picture of where memory is starting to go.

At this point, you should have a rough idea of the real cause, and
you're going to want to start dumping /proc/pid/smaps of target
processes to get an idea of where the memory they're allocating is
going, or at the very least what libraries might be involved.  The
next step is going to depend on what you need to target...

If it's python, it can get a bit tricky to see where the memory is
going but there's a number of approaches.  At the time, despite it
being mostly unmaintained but I had some success with guppy [1].  In
my case, for example, I managed to hook into swift's wsgi startup and
run that under guppy, giving me the ability to get some heap stats.
from my notes [4] that looked something like

---
import signal, os
from guppy import hpy

def handler(signum, frame):
    f = open('/tmp/heap.txt', 'w+')
    f.write("testing\n")
    hp = hpy()
    f.write(str(hp.heap()))
    f.close()

if __name__ == '__main__':
    conf_file, options = parse_options()
    signal.signal(signal.SIGUSR1, handler)

    sys.exit(run_wsgi(conf_file, 'object-server',
                      global_conf_callback=server.global_conf_callback,
                      **options))
---

There are of course other tools from gdb to malloc tracers, etc.

But that was enough that I could try different things and compare the
heap usage.  Once you've got the smoking gun ... well then the hard
work starts of fixing it :) In my case it was pycparser and we came up
with a good solution [5].

Hopefully that's some useful tips ... #openstack-infra can of course
help holding vms etc as required.

-i

[1] http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate-wrap.sh#n438
[2] https://git.openstack.org/cgit/openstack-dev/devstack/tree/tools/peakmem_tracker.sh
[3] https://pypi.python.org/pypi/guppy/
[4] https://etherpad.openstack.org/p/oom-in-rax-centos7-CI-job
[5] https://github.com/eliben/pycparser/issues/72

Open Stack

[openstack-dev] [infra][Neutron] Running out of memory on gate for linux bridge job

OpenStack

Community

Documentation

Branding & Legal