[OpenStack-Infra] Zuul memory leak

Mikhail Medvedev mihailmed at gmail.com
Mon Mar 7 23:53:39 UTC 2016


Hi Josh,

On Mon, Mar 7, 2016 at 5:25 PM, Joshua Hesketh <joshua.hesketh at gmail.com> wrote:
> Hi Mikhail,
>
> Thank you for the extra details. I'll continue to look into this.
>
> With the daily bumps when you do the log rotation, I assume you aren't
> reloading zuul at that point and the freed memory is likely due to another
> process?

I was puzzled by the bumps, and checked the syslog. They are definitely due to
"run-parts --report /etc/cron.daily" being triggered at 06:25, and not
zuul reloads.
The memory bumps could be due to any of the cron jobs. logrotate seemed likely.
For the record:

root at zuul:~# ls /etc/cron.daily
apache2  apport  apt  aptitude  bsdmainutils  dpkg  exim4-base
logrotate  man-db  mlocate  ntp  passwd  update-notifier-common
upstart

I have also confirmed there were no changes to zuul layout for the interval that
the graph shows.

>
> Cheers,
> Josh
>
> On Tue, Mar 8, 2016 at 10:17 AM, Mikhail Medvedev <mihailmed at gmail.com>
> wrote:
>>
>> On Wed, Feb 10, 2016 at 10:57 AM, James E. Blair <corvus at inaugust.com>
>> wrote:
>> > Michael Still <mikal at stillhq.com> writes:
>> >
>> >> On Tue, Feb 9, 2016 at 4:59 AM, Joshua Hesketh
>> >> <joshua.hesketh at gmail.com>
>> >> wrote:
>> >>
>> >>> On Thu, Feb 4, 2016 at 2:44 AM, James E. Blair <corvus at inaugust.com>
>> >>> wrote:
>> >>>>
>> >>>> On the subject of clearing the cache more often, I think we may not
>> >>>> want
>> >>>> to wipe out the cache more often than we do now -- in fact, I think
>> >>>> we
>> >>>> may want to look into ways to keep from doing even that, because
>> >>>> whenever we reload now, Zuul slows down considerably as it has to
>> >>>> query
>> >>>> Gerrit again for all of the data previously in its cache.
>> >>>>
>> >>>
>> >>> I can see a lot of 3rd parties or simpler CI's not needing to reload
>> >>> zuul
>> >>> very often so this cache would never get cleared. Perhaps cached
>> >>> objects
>> >>> should have an expiry time (of a day or so) and can be cleaned up
>> >>> periodically? Additionally if clearing the cache on a reload is
>> >>> causing
>> >>> pain maybe we should move the cache into the scheduler and keep it
>> >>> between
>> >>> reloads?
>> >>>
>> >>
>> >> Do you guys use oslo at all? I ask because the olso memcache stuff does
>> >> exactly this, so it should be trivial to implement if you don't mind
>> >> depending on oslo.
>> >
>> > One of the main things we use the cache for is to ensure that every
>> > change is represented by a single Change object in Zuul's memory.  The
>> > graph of enqueued Items link to their respective Changes which may link
>> > to each other due to dependencies.  When something changes in Gerrit, we
>> > want that reflected immediately and consistently in all of the objects
>> > in that graph.  Using the cache means that every time we add a new
>> > Change object to that graph, we use the same object for a given change.
>> >
>> > This is why we can't use time-based expiry -- we must not drop objects
>> > from the cache if they are still in the graph.  Otherwise we will create
>> > new duplicative objects and the ones still in the graph will not be
>> > updated.
>> >
>> > Perhaps we should change these objects to something more ephemeral that
>> > can proxy for some other mechanism that can operate more like a
>> > traditional cache (with time-based expiry).  But I think changes to this
>> > system should happen in Zuulv3 -- it works well enough for Zuulv2 for
>> > now.
>> >
>> > -Jim
>> >
>>
>> We are one of third-party CIs and using "Zuul version: 2.1.1.dev123",
>> which is one commit after [1]. That one commit after is not in tree - I am
>> applying [2] on top.
>>
>> The VM has 8GB of RAM. zuul-server memory footprint goes up consistently
>> over
>> the course of a week. Normally it takes about 3-4 days to get over to 3Gb.
>> About a week ago I witnessed zuul-server get to 95% of RAM, at which point
>> kernel started killing other processes. The graph [3] memory [3], and it
>> reflects zuul-server consumption. The daily bumps on the graph are daily
>> cron
>> doing log rotation etc, possibly flushing caches.
>>
>> I can not say 100% that it is still the leak. Could simply be that
>> zuul-server
>> requires more ram now.
>>
>> [1]
>> https://review.openstack.org/#q,I81ee47524cda71a500c55a95a2280f491b1b63d9,n,z
>> [2]
>> https://review.openstack.org/#q,If3a418fa2d4993a149d454e02a9b26529e4b6825,n,z
>> [3] http://imgur.com/SzqSA1H
>>
>> Mikhail Medvedev (mmedvede)
>>
>> _______________________________________________
>> OpenStack-Infra mailing list
>> OpenStack-Infra at lists.openstack.org
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
>
>



More information about the OpenStack-Infra mailing list