[OpenStack-Infra] Zuul memory leak

Mikhail Medvedev mihailmed at gmail.com
Mon Mar 7 23:17:44 UTC 2016


On Wed, Feb 10, 2016 at 10:57 AM, James E. Blair <corvus at inaugust.com> wrote:
> Michael Still <mikal at stillhq.com> writes:
>
>> On Tue, Feb 9, 2016 at 4:59 AM, Joshua Hesketh <joshua.hesketh at gmail.com>
>> wrote:
>>
>>> On Thu, Feb 4, 2016 at 2:44 AM, James E. Blair <corvus at inaugust.com>
>>> wrote:
>>>>
>>>> On the subject of clearing the cache more often, I think we may not want
>>>> to wipe out the cache more often than we do now -- in fact, I think we
>>>> may want to look into ways to keep from doing even that, because
>>>> whenever we reload now, Zuul slows down considerably as it has to query
>>>> Gerrit again for all of the data previously in its cache.
>>>>
>>>
>>> I can see a lot of 3rd parties or simpler CI's not needing to reload zuul
>>> very often so this cache would never get cleared. Perhaps cached objects
>>> should have an expiry time (of a day or so) and can be cleaned up
>>> periodically? Additionally if clearing the cache on a reload is causing
>>> pain maybe we should move the cache into the scheduler and keep it between
>>> reloads?
>>>
>>
>> Do you guys use oslo at all? I ask because the olso memcache stuff does
>> exactly this, so it should be trivial to implement if you don't mind
>> depending on oslo.
>
> One of the main things we use the cache for is to ensure that every
> change is represented by a single Change object in Zuul's memory.  The
> graph of enqueued Items link to their respective Changes which may link
> to each other due to dependencies.  When something changes in Gerrit, we
> want that reflected immediately and consistently in all of the objects
> in that graph.  Using the cache means that every time we add a new
> Change object to that graph, we use the same object for a given change.
>
> This is why we can't use time-based expiry -- we must not drop objects
> from the cache if they are still in the graph.  Otherwise we will create
> new duplicative objects and the ones still in the graph will not be
> updated.
>
> Perhaps we should change these objects to something more ephemeral that
> can proxy for some other mechanism that can operate more like a
> traditional cache (with time-based expiry).  But I think changes to this
> system should happen in Zuulv3 -- it works well enough for Zuulv2 for
> now.
>
> -Jim
>

We are one of third-party CIs and using "Zuul version: 2.1.1.dev123",
which is one commit after [1]. That one commit after is not in tree - I am
applying [2] on top.

The VM has 8GB of RAM. zuul-server memory footprint goes up consistently over
the course of a week. Normally it takes about 3-4 days to get over to 3Gb.
About a week ago I witnessed zuul-server get to 95% of RAM, at which point
kernel started killing other processes. The graph [3] memory [3], and it
reflects zuul-server consumption. The daily bumps on the graph are daily cron
doing log rotation etc, possibly flushing caches.

I can not say 100% that it is still the leak. Could simply be that zuul-server
requires more ram now.

[1] https://review.openstack.org/#q,I81ee47524cda71a500c55a95a2280f491b1b63d9,n,z
[2] https://review.openstack.org/#q,If3a418fa2d4993a149d454e02a9b26529e4b6825,n,z
[3] http://imgur.com/SzqSA1H

Mikhail Medvedev (mmedvede)



More information about the OpenStack-Infra mailing list