[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

Armando M. armamig at gmail.com
Thu Feb 2 00:24:54 UTC 2017


[TL;DR]: OpenStack services have steadily increased their memory
footprints. We need a concerted way to address the oom-kills experienced in
the openstack gate, as we may have reached a ceiling.

Now the longer version:

We have been experiencing some instability in the gate lately due to a
number of reasons. When everything adds up, this means it's rather
difficult to merge anything and knowing we're in feature freeze, that adds
to stress. One culprit was identified to be [1].

We initially tried to increase the swappiness, but that didn't seem to
help. Then we have looked at the resident memory in use. When going back
over the past three releases we have noticed that the aggregated memory
footprint of some openstack projects has grown steadily. We have the

   - Mitaka
      - neutron: 1.40GB
      - nova: 1.70GB
      - swift: 640MB
      - cinder: 730MB
      - keystone: 760MB
      - horizon: 17MB
      - glance: 538MB
   - Newton
   - neutron: 1.59GB (+13%)
      - nova: 1.67GB (-1%)
      - swift: 779MB (+21%)
      - cinder: 878MB (+20%)
      - keystone: 919MB (+20%)
      - horizon: 21MB (+23%)
      - glance: 721MB (+34%)
   - Ocata
      - neutron: 1.75GB (+10%)
      - nova: 1.95GB (%16%)
      - swift: 703MB (-9%)
      - cinder: 920MB (4%)
      - keystone: 903MB (-1%)
      - horizon: 25MB (+20%)
      - glance: 740MB (+2%)

Numbers are approximated and I only took a couple of samples, but in a
nutshell, the majority of the services have seen double digit growth over
the past two cycles in terms of the amount or RSS memory they use.

Since [1] is observed only since ocata [2], I imagine that's pretty
reasonable to assume that memory increase might as well be a determining
factor to the oom-kills we see in the gate.

Profiling and surgically reducing the memory used by each component in each
service is a lengthy process, but I'd rather see some gate relief right
away. Reducing the number of API workers helps bring the RSS memory down
back to mitaka levels:

   - neutron: 1.54GB
   - nova: 1.24GB
   - swift: 694MB
   - cinder: 778MB
   - keystone: 891MB
   - horizon: 24MB
   - glance: 490MB

However, it may have other side effects, like longer execution times, or
increase of timeouts.

Where do we go from here? I am not particularly fond of stop-gap [4], but
it is the one fix that most widely address the memory increase we have
experienced across the board.


[1] https://bugs.launchpad.net/neutron/+bug/1656386
[4] https://review.openstack.org/#/c/427921
