[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

Ihar Hrachyshka ihrachys at redhat.com
Wed Feb 15 21:00:08 UTC 2017


Another potentially interesting devstack service that may help us to
understand our memory usage is peakmem_tracker. At this point, it's
not enabled anywhere. I proposed devstack-gate patch to enable it at:
https://review.openstack.org/#/c/434511/

On Wed, Feb 15, 2017 at 12:38 PM, Ihar Hrachyshka <ihrachys at redhat.com> wrote:
> Another potentially relevant info is, we saw before that oom-killer is
> triggered while 8gb of swap are barely used. This behavior is hard to
> explain, since we set kernel swappiness sysctl knob to 30:
>
> https://github.com/openstack-infra/devstack-gate/blob/master/functions.sh#L432
>
> (and any value above 0 means that if memory is requested, and there is
> swap available to fulfill it, it will not fail to allocate memory;
> swappiness only controls willingness of kernel to swap process pages
> instead of dropping disk cache entries, it may affect performance, but
> it should not affect malloc behavior).
>
> The only reason I can think of for a memory allocation request to
> trigger the trap when swap is free is when the memory request is for a
> RAM-locked page (it can either be memory locked with mlock(2), or
> mmap(2) when MAP_LOCKED used). To understand if that's the case in
> gate, I am adding a new mlock_tracker service to devstack:
> https://review.openstack.org/#/c/434470/
>
> The patch that enables the service in Pike+ gate is:
> https://review.openstack.org/#/c/434474/
>
> Thanks,
> Ihar
>
> On Wed, Feb 15, 2017 at 5:21 AM, Andrea Frittoli
> <andrea.frittoli at gmail.com> wrote:
>> Some (new?) data on the oom kill issue in the gate.
>>
>> I filed a new bug / E-R query yet for the issue [1][2] since it looks to me
>> like the issue is not specific to mysqld - oom-kill will just pick the best
>> candidate, which in most cases happens to be mysqld. The next most likely
>> candidate to show errors in the logs is keystone, since token requests are
>> rather frequent, more than any other API call probably.
>>
>> According to logstash [3] all failures identified by [2] happen on RAX nodes
>> [3], which I hadn't realised before.
>>
>> Comparing dstat data between the failed run and a successful on an OVH node
>> [4], the main difference I can spot is free memory.
>> For the same test job, the free memory tends to be much lower, quite close
>> to zero for the majority of the time on the RAX node. My guess is that an
>> unlucky scheduling of tests may cause a slightly higher peak in memory usage
>> and trigger the oom-kill.
>>
>> I find it hard to relate lower free memory to a specific cloud provider /
>> underlying virtualisation technology, but maybe someone has an idea about
>> how that could be?
>>
>> Andrea
>>
>> [0]
>> http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/6f31320/logs/syslog.txt.gz#_Feb_14_00_32_28
>> [1] https://bugs.launchpad.net/tempest/+bug/1664953
>> [2] https://review.openstack.org/434238
>> [3]
>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Out%20of%20memory%3A%20Kill%20process%5C%22%20AND%20tags%3A%5C%22syslog.txt%5C%22
>> [4]
>> http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/1dfb4b7/logs/dstat-csv_log.txt.gz
>>
>> On Mon, Feb 6, 2017 at 10:13 AM Miguel Angel Ajo Pelayo
>> <majopela at redhat.com> wrote:
>>>
>>> Jeremy Stanley wrote:
>>>
>>>
>>> > It's an option of last resort, I think. The next consistent flavor
>>> > up in most of the providers donating resources is double the one
>>> > we're using (which is a fairly typical pattern in public clouds). As
>>> > aggregate memory constraints are our primary quota limit, this would
>>> > effectively halve our current job capacity.
>>>
>>> Properly coordinated with all the cloud the providers, they could create
>>> flavours which are private but available to our tenants, where a 25-50% more
>>> RAM would be just enough.
>>>
>>> I agree that should probably be a last resort tool, and we should keep
>>> looking for proper ways to find where we consume unnecessary RAM and make
>>> sure that's properly freed up.
>>>
>>> It could be interesting to coordinate such flavour creation in the mean
>>> time, even if we don't use it now, we could eventually test it or put it to
>>> work if we find trapped anytime later.
>>>
>>>
>>> On Sun, Feb 5, 2017 at 8:37 PM, Matt Riedemann <mriedemos at gmail.com>
>>> wrote:
>>>>
>>>> On 2/5/2017 1:19 PM, Clint Byrum wrote:
>>>>>
>>>>>
>>>>> Also I wonder if there's ever been any serious consideration given to
>>>>> switching to protobuf? Feels like one could make oslo.versionedobjects
>>>>> a wrapper around protobuf relatively easily, but perhaps that's already
>>>>> been explored in a forum that I wasn't paying attention to.
>>>>
>>>>
>>>> I've never heard of anyone attempting that.
>>>>
>>>> --
>>>>
>>>> Thanks,
>>>>
>>>> Matt Riedemann
>>>>
>>>>
>>>>
>>>> __________________________________________________________________________
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe:
>>>> OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>>
>>> __________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>



More information about the OpenStack-dev mailing list