[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate
Miguel Angel Ajo Pelayo
majopela at redhat.com
Fri Feb 3 10:12:04 UTC 2017
On Fri, Feb 3, 2017 at 7:55 AM, IWAMOTO Toshihiro <iwamoto at valinux.co.jp>
> At Wed, 1 Feb 2017 16:24:54 -0800,
> Armando M. wrote:
> > Hi,
> > [TL;DR]: OpenStack services have steadily increased their memory
> > footprints. We need a concerted way to address the oom-kills experienced
> > the openstack gate, as we may have reached a ceiling.
> > Now the longer version:
> > --------------------------------
> > We have been experiencing some instability in the gate lately due to a
> > number of reasons. When everything adds up, this means it's rather
> > difficult to merge anything and knowing we're in feature freeze, that
> > to stress. One culprit was identified to be .
> > We initially tried to increase the swappiness, but that didn't seem to
> > help. Then we have looked at the resident memory in use. When going back
> > over the past three releases we have noticed that the aggregated memory
> > footprint of some openstack projects has grown steadily. We have the
> > following:
> Not sure if it is due to memory shortage, VMs running CI jobs are
> experiencing sluggishness, which may be the cause of ovs related
> timeouts. Tempest jobs run dstat to collect system info every
> second. When timeouts happen, dstat outputs are also often missing
> for several seconds, which means a VM is having trouble scheduling
> both ovs related processes and the dstat process.
> Those ovs timeouts affect every project and happen much often than the
> Some details are on the lp bug page.
> Correlation of such sluggishness and VM paging activities are not
> clear. I wonder if VM hosts are under high load or if increasing VM
> memory would help. Those VMs have no free ram for file cache and file
> pages are read again and again, leading to extra IO loads on VM hosts
> and adversely affecting other VMs on the same host.
Iwamoto, that makes a lot of sense to me.
That makes me think that increasing the available RAM per instance could be
beneficial, even if we'd be able to run less workloads simultaneously.
Compute hosts would see their pressure reduced (since they can accommodate
less workload), instances would run more smoothly, because they'd have more
room for caching and buffers, and we may also see the OOM issues alleviated.
BUT, if that's even a suitable approach for all those problems which could
very well be inter-related, it still means that we should keep pursuing
finding the culprit of our memory footprint growth and taking counter
measures where reasonable.
Sometimes more RAM is just the cost of progress (new features, ability to
do online upgrades, better synchronisation patterns based in caching,
etc...), sometimes we'd be able to slash down the memory usage by
converting some of our small-repeatable services into other things (I'm
thinking of the neutron-ns-metadata proxy being converted to haproxy or
nginx + a neat piece of config).
So, would it be realistic to bump the flavors RAM to favor our stability in
the short term? (considering that the less amount of workload our clouds
will be able to take is fewer, but the failure rate will also be fewer, so
the rechecks will be reduced).
>  http://logstash.openstack.org/#dashboard/file/logstash.json?
>  https://bugs.launchpad.net/neutron/+bug/1627106/comments/14
> IWAMOTO Toshihiro
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev