[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

Matthew Treinish mtreinish at kortar.org
Thu Feb 2 15:44:38 UTC 2017


On Thu, Feb 02, 2017 at 04:27:51AM +0000, Dolph Mathews wrote:
> What made most services jump +20% between mitaka and newton? Maybe there is
> a common cause that we can tackle.

Yeah, I'm curious about this too, there seems to be a big jump in Newton for
most of the project. It might not a be a single common cause between them, but
I'd be curious to know what's going on there. 

> 
> I'd also be in favor of reducing the number of workers in the gate,
> assuming that doesn't also substantially increase the runtime of gate jobs.
> Does that environment variable (API_WORKERS) affect keystone and horizon?

It affects keystone, in certain deploy modes (only uwsgi standalone I think,
which menas not for most jobs) if it's running under apache we rely on apache
to handle things. Which is why this doesn't work on horizon.

API_WORKERS was the interface we added to devstack after we started having OOM
issues the first time around (roughly 2 years ago) Back then we were running
the service defaults which in most cases was nprocs for the number of workers.
API_WORKERS was added to have a global flag to set that to something else for
all the services. Right now it defaults to nproc/4 as long as that's >=2:

https://github.com/openstack-dev/devstack/blob/master/stackrc#L714

which basically means in the gate right now we're only running with 2 api
workers per server. It's just that a lot of 

-Matt Treinish

> 
> On Wed, Feb 1, 2017 at 6:39 PM Kevin Benton <kevin at benton.pub> wrote:
> 
> > And who said openstack wasn't growing? ;)
> >
> > I think reducing API workers is a nice quick way to bring back some
> > stability.
> >
> > I have spent a bunch of time digging into the OOM killer events and
> > haven't yet figured out why they are being triggered. There is significant
> > swap space remaining in all of the cases I have seen so it's likely some
> > memory locking issue or kernel allocations blocking swap. Until we can
> > figure out the cause, we effectively have no usable swap space on the test
> > instances so we are limited to 8GB.
> >
> > On Feb 1, 2017 17:27, "Armando M." <armamig at gmail.com> wrote:
> >
> > Hi,
> >
> > [TL;DR]: OpenStack services have steadily increased their memory
> > footprints. We need a concerted way to address the oom-kills experienced in
> > the openstack gate, as we may have reached a ceiling.
> >
> > Now the longer version:
> > --------------------------------
> >
> > We have been experiencing some instability in the gate lately due to a
> > number of reasons. When everything adds up, this means it's rather
> > difficult to merge anything and knowing we're in feature freeze, that adds
> > to stress. One culprit was identified to be [1].
> >
> > We initially tried to increase the swappiness, but that didn't seem to
> > help. Then we have looked at the resident memory in use. When going back
> > over the past three releases we have noticed that the aggregated memory
> > footprint of some openstack projects has grown steadily. We have the
> > following:
> >
> >    - Mitaka
> >       - neutron: 1.40GB
> >       - nova: 1.70GB
> >       - swift: 640MB
> >       - cinder: 730MB
> >       - keystone: 760MB
> >       - horizon: 17MB
> >       - glance: 538MB
> >    - Newton
> >    - neutron: 1.59GB (+13%)
> >       - nova: 1.67GB (-1%)
> >       - swift: 779MB (+21%)
> >       - cinder: 878MB (+20%)
> >       - keystone: 919MB (+20%)
> >       - horizon: 21MB (+23%)
> >       - glance: 721MB (+34%)
> >    - Ocata
> >       - neutron: 1.75GB (+10%)
> >       - nova: 1.95GB (%16%)
> >       - swift: 703MB (-9%)
> >       - cinder: 920MB (4%)
> >       - keystone: 903MB (-1%)
> >       - horizon: 25MB (+20%)
> >       - glance: 740MB (+2%)
> >
> > Numbers are approximated and I only took a couple of samples, but in a
> > nutshell, the majority of the services have seen double digit growth over
> > the past two cycles in terms of the amount or RSS memory they use.
> >
> > Since [1] is observed only since ocata [2], I imagine that's pretty
> > reasonable to assume that memory increase might as well be a determining
> > factor to the oom-kills we see in the gate.
> >
> > Profiling and surgically reducing the memory used by each component in
> > each service is a lengthy process, but I'd rather see some gate relief
> > right away. Reducing the number of API workers helps bring the RSS memory
> > down back to mitaka levels:
> >
> >    - neutron: 1.54GB
> >    - nova: 1.24GB
> >    - swift: 694MB
> >    - cinder: 778MB
> >    - keystone: 891MB
> >    - horizon: 24MB
> >    - glance: 490MB
> >
> > However, it may have other side effects, like longer execution times, or
> > increase of timeouts.
> >
> > Where do we go from here? I am not particularly fond of stop-gap [4], but
> > it is the one fix that most widely address the memory increase we have
> > experienced across the board.
> >
> > Thanks,
> > Armando
> >
> > [1] https://bugs.launchpad.net/neutron/+bug/1656386
> > [2]
> > http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22oom-killer%5C%22%20AND%20tags:syslog
> > [3]
> > http://logs.openstack.org/21/427921/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/82084c2/
> > [4] https://review.openstack.org/#/c/427921
> >
> > __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> > __________________________________________________________________________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> -- 
> -Dolph

> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170202/795933ec/attachment.pgp>


More information about the OpenStack-dev mailing list