[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

Andrea Frittoli andrea.frittoli at gmail.com
Wed Feb 15 13:21:16 UTC 2017


Some (new?) data on the oom kill issue in the gate.

I filed a new bug / E-R query yet for the issue [1][2] since it looks to me
like the issue is not specific to mysqld - oom-kill will just pick the best
candidate, which in most cases happens to be mysqld. The next most likely
candidate to show errors in the logs is keystone, since token requests are
rather frequent, more than any other API call probably.

According to logstash [3] all failures identified by [2] happen on RAX
nodes [3], which I hadn't realised before.

Comparing dstat data between the failed run and a successful on an OVH node
[4], the main difference I can spot is free memory.
For the same test job, the free memory tends to be much lower, quite close
to zero for the majority of the time on the RAX node. My guess is that an
unlucky scheduling of tests may cause a slightly higher peak in memory
usage and trigger the oom-kill.

I find it hard to relate lower free memory to a specific cloud provider /
underlying virtualisation technology, but maybe someone has an idea about
how that could be?

Andrea

[0]
http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/6f31320/logs/syslog.txt.gz#_Feb_14_00_32_28

[1] https://bugs.launchpad.net/tempest/+bug/1664953
[2] https://review.openstack.org/434238
[3]
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Out%20of%20memory%3A%20Kill%20process%5C%22%20AND%20tags%3A%5C%22syslog.txt%5C%22

[4]
http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/1dfb4b7/logs/dstat-csv_log.txt.gz


On Mon, Feb 6, 2017 at 10:13 AM Miguel Angel Ajo Pelayo <majopela at redhat.com>
wrote:

Jeremy Stanley wrote:


> It's an option of last resort, I think. The next consistent flavor
> up in most of the providers donating resources is double the one
> we're using (which is a fairly typical pattern in public clouds). As
> aggregate memory constraints are our primary quota limit, this would
> effectively halve our current job capacity.

Properly coordinated with all the cloud the providers, they could create
flavours which are private but available to our tenants, where a 25-50%
more RAM would be just enough.

I agree that should probably be a last resort tool, and we should keep
looking for proper ways to find where we consume unnecessary RAM and make
sure that's properly freed up.

It could be interesting to coordinate such flavour creation in the mean
time, even if we don't use it now, we could eventually test it or put it to
work if we find trapped anytime later.


On Sun, Feb 5, 2017 at 8:37 PM, Matt Riedemann <mriedemos at gmail.com> wrote:

On 2/5/2017 1:19 PM, Clint Byrum wrote:


Also I wonder if there's ever been any serious consideration given to
switching to protobuf? Feels like one could make oslo.versionedobjects
a wrapper around protobuf relatively easily, but perhaps that's already
been explored in a forum that I wasn't paying attention to.


I've never heard of anyone attempting that.

-- 

Thanks,

Matt Riedemann


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170215/95c5c89c/attachment.html>


More information about the OpenStack-dev mailing list