[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

Matthew Treinish mtreinish at kortar.org
Thu Feb 2 16:10:22 UTC 2017


On Wed, Feb 01, 2017 at 04:24:54PM -0800, Armando M. wrote:
> Hi,
> 
> [TL;DR]: OpenStack services have steadily increased their memory
> footprints. We need a concerted way to address the oom-kills experienced in
> the openstack gate, as we may have reached a ceiling.
> 
> Now the longer version:
> --------------------------------
> 
> We have been experiencing some instability in the gate lately due to a
> number of reasons. When everything adds up, this means it's rather
> difficult to merge anything and knowing we're in feature freeze, that adds
> to stress. One culprit was identified to be [1].
> 
> We initially tried to increase the swappiness, but that didn't seem to
> help. Then we have looked at the resident memory in use. When going back
> over the past three releases we have noticed that the aggregated memory
> footprint of some openstack projects has grown steadily. We have the
> following:
> 
>    - Mitaka
>       - neutron: 1.40GB
>       - nova: 1.70GB
>       - swift: 640MB
>       - cinder: 730MB
>       - keystone: 760MB
>       - horizon: 17MB
>       - glance: 538MB
>    - Newton
>    - neutron: 1.59GB (+13%)
>       - nova: 1.67GB (-1%)
>       - swift: 779MB (+21%)
>       - cinder: 878MB (+20%)
>       - keystone: 919MB (+20%)
>       - horizon: 21MB (+23%)
>       - glance: 721MB (+34%)
>    - Ocata
>       - neutron: 1.75GB (+10%)
>       - nova: 1.95GB (%16%)
>       - swift: 703MB (-9%)
>       - cinder: 920MB (4%)
>       - keystone: 903MB (-1%)
>       - horizon: 25MB (+20%)
>       - glance: 740MB (+2%)
> 
> Numbers are approximated and I only took a couple of samples, but in a
> nutshell, the majority of the services have seen double digit growth over
> the past two cycles in terms of the amount or RSS memory they use.
> 
> Since [1] is observed only since ocata [2], I imagine that's pretty
> reasonable to assume that memory increase might as well be a determining
> factor to the oom-kills we see in the gate.
> 
> Profiling and surgically reducing the memory used by each component in each
> service is a lengthy process, but I'd rather see some gate relief right
> away. Reducing the number of API workers helps bring the RSS memory down
> back to mitaka levels:
> 
>    - neutron: 1.54GB
>    - nova: 1.24GB
>    - swift: 694MB
>    - cinder: 778MB
>    - keystone: 891MB
>    - horizon: 24MB
>    - glance: 490MB
> 
> However, it may have other side effects, like longer execution times, or
> increase of timeouts.
> 
> Where do we go from here? I am not particularly fond of stop-gap [4], but
> it is the one fix that most widely address the memory increase we have
> experienced across the board.

So I have a couple of concerns with doing this. We're only running with 2
workers per api service now and dropping it down to 1 means we have no more
memory head room in the future. So this feels like we're just delaying the
inevitable maybe for a cycle or 2. When we first started hitting OOM issues a
couple years ago we dropped from nprocs to nprocs/2. [5] Back then we were also
running more services per job, it was back in the day of the integrated release
so all those projects were running. (like ceilometer, heat, etc.) So in a little
over 2 years the memory consumption for the 7 services has increased to the
point where we're making up for a bunch of extra services that don't run in the
job anymore and we had to drop the worker count in half since. So if we were to
do this we don't have anymore room for when things keep growing. I think now is
the time we should start seriously taking a stance on our memory footprint
growth and see if we can get it under control.

My second concern is the same as you here, the long term effects of this change
aren't exactly clear. With the limited sample size of the test patch[4] we can't
really say if it'll negatively affect run time or job success rates. I don't think
it should be too bad, tempest is only making 4 api requests at a time, and most of
the services should be able to handle that kinda load with a single worker. (I'd
hope)

This also does bring up the question of the gate config being representative
of how we recommend running OpenStack. Like the reasons we try to use default
config values as much as possible in devstack. We definitely aren't saying
running a single worker

But, I'm not sure any of that is a blocker for moving forward with dropping down
to a single worker.

As an aside, I also just pushed up: https://review.openstack.org/#/c/428220/ to
see if that provides any useful info. I'm doubtful that it will be helpful,
because it's the combination of services running causing the issue. But it
doesn't really hurt to collect that.

-Matt Treinish

> [1] https://bugs.launchpad.net/neutron/+bug/1656386
> [2]
> http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog
> [3]
> http://logs.openstack.org/21/427921/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/
> [4] https://review.openstack.org/#/c/427921

[5] http://blog.kortar.org/?p=127
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170202/af21372b/attachment.pgp>


More information about the OpenStack-dev mailing list