[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

Joshua Harlow harlowja at fastmail.com
Sat Feb 4 18:13:43 UTC 2017


Another option is to turn on the following (for python 3.4+ jobs)

https://docs.python.org/3/library/tracemalloc.html

I think victor stinner (who we all know as haypo) has some experience 
with that and even did some of the backport patches for 2.7 for this may 
have some ideas on how we can plug that in.

Then assuming the following works we can even have a nice UI to analyze 
its reports & do comparison diffs:

http://pytracemalloc.readthedocs.io/tracemallocqt.html

One idea from mtreinish was to hook the following (or some variant of 
it) into oslo.service to get some data:

http://pytracemalloc.readthedocs.io/examples.html#thread-to-write-snapshots-into-files-every-minutes

Of course the other big question (that I don't actually know) is how 
does tracemalloc work in wsgi containers (such as apache or eventlet or 
uwsgi or ...). Seeing that a part of our http services are in such 
containers it seems like a useful thing to wonder :)

-Josh

Joshua Harlow wrote:
> An example of what this (dozer) gathers (attached).
>
> -Josh
>
> Joshua Harlow wrote:
>> Has anyone tried:
>>
>> https://github.com/mgedmin/dozer/blob/master/dozer/leak.py#L72
>>
>> This piece of middleware creates some nice graphs (using PIL) that may
>> help identify which areas are using what memory (and/or leaking).
>>
>> https://pypi.python.org/pypi/linesman might also be somewhat useful to
>> have running.
>>
>> How any process takes more than 100MB here blows my mind (horizon is
>> doing nicely, ha); what are people caching in process to have RSS that
>> large (1.95 GB, woah).
>>
>> Armando M. wrote:
>>> Hi,
>>>
>>> [TL;DR]: OpenStack services have steadily increased their memory
>>> footprints. We need a concerted way to address the oom-kills experienced
>>> in the openstack gate, as we may have reached a ceiling.
>>>
>>> Now the longer version:
>>> --------------------------------
>>>
>>> We have been experiencing some instability in the gate lately due to a
>>> number of reasons. When everything adds up, this means it's rather
>>> difficult to merge anything and knowing we're in feature freeze, that
>>> adds to stress. One culprit was identified to be [1].
>>>
>>> We initially tried to increase the swappiness, but that didn't seem to
>>> help. Then we have looked at the resident memory in use. When going back
>>> over the past three releases we have noticed that the aggregated memory
>>> footprint of some openstack projects has grown steadily. We have the
>>> following:
>>>
>>> * Mitaka
>>> o neutron: 1.40GB
>>> o nova: 1.70GB
>>> o swift: 640MB
>>> o cinder: 730MB
>>> o keystone: 760MB
>>> o horizon: 17MB
>>> o glance: 538MB
>>> * Newton
>>> o neutron: 1.59GB (+13%)
>>> o nova: 1.67GB (-1%)
>>> o swift: 779MB (+21%)
>>> o cinder: 878MB (+20%)
>>> o keystone: 919MB (+20%)
>>> o horizon: 21MB (+23%)
>>> o glance: 721MB (+34%)
>>> * Ocata
>>> o neutron: 1.75GB (+10%)
>>> o nova: 1.95GB (%16%)
>>> o swift: 703MB (-9%)
>>> o cinder: 920MB (4%)
>>> o keystone: 903MB (-1%)
>>> o horizon: 25MB (+20%)
>>> o glance: 740MB (+2%)
>>>
>>> Numbers are approximated and I only took a couple of samples, but in a
>>> nutshell, the majority of the services have seen double digit growth
>>> over the past two cycles in terms of the amount or RSS memory they use.
>>>
>>> Since [1] is observed only since ocata [2], I imagine that's pretty
>>> reasonable to assume that memory increase might as well be a determining
>>> factor to the oom-kills we see in the gate.
>>>
>>> Profiling and surgically reducing the memory used by each component in
>>> each service is a lengthy process, but I'd rather see some gate relief
>>> right away. Reducing the number of API workers helps bring the RSS
>>> memory down back to mitaka levels:
>>>
>>> * neutron: 1.54GB
>>> * nova: 1.24GB
>>> * swift: 694MB
>>> * cinder: 778MB
>>> * keystone: 891MB
>>> * horizon: 24MB
>>> * glance: 490MB
>>>
>>> However, it may have other side effects, like longer execution times, or
>>> increase of timeouts.
>>>
>>> Where do we go from here? I am not particularly fond of stop-gap [4],
>>> but it is the one fix that most widely address the memory increase we
>>> have experienced across the board.
>>>
>>> Thanks,
>>> Armando
>>>
>>> [1] https://bugs.launchpad.net/neutron/+bug/1656386
>>> <https://bugs.launchpad.net/neutron/+bug/1656386>
>>> [2]
>>> http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog
>>>
>>>
>>> [3]
>>> http://logs.openstack.org/21/427921/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/
>>>
>>>
>>> [4] https://review.openstack.org/#/c/427921
>>>
>>> __________________________________________________________________________
>>>
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>> __________________________________________________________________________
>>
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list