[OpenStack-Infra] Zuul memory leak
Joshua Hesketh
joshua.hesketh at gmail.com
Tue Mar 8 00:04:19 UTC 2016
Hi Mikhail,
Okay thanks, that's helpful.
You mentioned that you might try restarting zuul periodically to see if
that helps. Perhaps instead you could do a reload (or HUP) first to see if
that clears the cache and alleviates the issue for you?
Cheers,
Josh
On Tue, Mar 8, 2016 at 10:53 AM, Mikhail Medvedev <mihailmed at gmail.com>
wrote:
> Hi Josh,
>
> On Mon, Mar 7, 2016 at 5:25 PM, Joshua Hesketh <joshua.hesketh at gmail.com>
> wrote:
> > Hi Mikhail,
> >
> > Thank you for the extra details. I'll continue to look into this.
> >
> > With the daily bumps when you do the log rotation, I assume you aren't
> > reloading zuul at that point and the freed memory is likely due to
> another
> > process?
>
> I was puzzled by the bumps, and checked the syslog. They are definitely
> due to
> "run-parts --report /etc/cron.daily" being triggered at 06:25, and not
> zuul reloads.
> The memory bumps could be due to any of the cron jobs. logrotate seemed
> likely.
> For the record:
>
> root at zuul:~# ls /etc/cron.daily
> apache2 apport apt aptitude bsdmainutils dpkg exim4-base
> logrotate man-db mlocate ntp passwd update-notifier-common
> upstart
>
> I have also confirmed there were no changes to zuul layout for the
> interval that
> the graph shows.
>
> >
> > Cheers,
> > Josh
> >
> > On Tue, Mar 8, 2016 at 10:17 AM, Mikhail Medvedev <mihailmed at gmail.com>
> > wrote:
> >>
> >> On Wed, Feb 10, 2016 at 10:57 AM, James E. Blair <corvus at inaugust.com>
> >> wrote:
> >> > Michael Still <mikal at stillhq.com> writes:
> >> >
> >> >> On Tue, Feb 9, 2016 at 4:59 AM, Joshua Hesketh
> >> >> <joshua.hesketh at gmail.com>
> >> >> wrote:
> >> >>
> >> >>> On Thu, Feb 4, 2016 at 2:44 AM, James E. Blair <corvus at inaugust.com
> >
> >> >>> wrote:
> >> >>>>
> >> >>>> On the subject of clearing the cache more often, I think we may not
> >> >>>> want
> >> >>>> to wipe out the cache more often than we do now -- in fact, I think
> >> >>>> we
> >> >>>> may want to look into ways to keep from doing even that, because
> >> >>>> whenever we reload now, Zuul slows down considerably as it has to
> >> >>>> query
> >> >>>> Gerrit again for all of the data previously in its cache.
> >> >>>>
> >> >>>
> >> >>> I can see a lot of 3rd parties or simpler CI's not needing to reload
> >> >>> zuul
> >> >>> very often so this cache would never get cleared. Perhaps cached
> >> >>> objects
> >> >>> should have an expiry time (of a day or so) and can be cleaned up
> >> >>> periodically? Additionally if clearing the cache on a reload is
> >> >>> causing
> >> >>> pain maybe we should move the cache into the scheduler and keep it
> >> >>> between
> >> >>> reloads?
> >> >>>
> >> >>
> >> >> Do you guys use oslo at all? I ask because the olso memcache stuff
> does
> >> >> exactly this, so it should be trivial to implement if you don't mind
> >> >> depending on oslo.
> >> >
> >> > One of the main things we use the cache for is to ensure that every
> >> > change is represented by a single Change object in Zuul's memory. The
> >> > graph of enqueued Items link to their respective Changes which may
> link
> >> > to each other due to dependencies. When something changes in Gerrit,
> we
> >> > want that reflected immediately and consistently in all of the objects
> >> > in that graph. Using the cache means that every time we add a new
> >> > Change object to that graph, we use the same object for a given
> change.
> >> >
> >> > This is why we can't use time-based expiry -- we must not drop objects
> >> > from the cache if they are still in the graph. Otherwise we will
> create
> >> > new duplicative objects and the ones still in the graph will not be
> >> > updated.
> >> >
> >> > Perhaps we should change these objects to something more ephemeral
> that
> >> > can proxy for some other mechanism that can operate more like a
> >> > traditional cache (with time-based expiry). But I think changes to
> this
> >> > system should happen in Zuulv3 -- it works well enough for Zuulv2 for
> >> > now.
> >> >
> >> > -Jim
> >> >
> >>
> >> We are one of third-party CIs and using "Zuul version: 2.1.1.dev123",
> >> which is one commit after [1]. That one commit after is not in tree - I
> am
> >> applying [2] on top.
> >>
> >> The VM has 8GB of RAM. zuul-server memory footprint goes up consistently
> >> over
> >> the course of a week. Normally it takes about 3-4 days to get over to
> 3Gb.
> >> About a week ago I witnessed zuul-server get to 95% of RAM, at which
> point
> >> kernel started killing other processes. The graph [3] memory [3], and it
> >> reflects zuul-server consumption. The daily bumps on the graph are daily
> >> cron
> >> doing log rotation etc, possibly flushing caches.
> >>
> >> I can not say 100% that it is still the leak. Could simply be that
> >> zuul-server
> >> requires more ram now.
> >>
> >> [1]
> >>
> https://review.openstack.org/#q,I81ee47524cda71a500c55a95a2280f491b1b63d9,n,z
> >> [2]
> >>
> https://review.openstack.org/#q,If3a418fa2d4993a149d454e02a9b26529e4b6825,n,z
> >> [3] http://imgur.com/SzqSA1H
> >>
> >> Mikhail Medvedev (mmedvede)
> >>
> >> _______________________________________________
> >> OpenStack-Infra mailing list
> >> OpenStack-Infra at lists.openstack.org
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-infra/attachments/20160308/0313c542/attachment.html>
More information about the OpenStack-Infra
mailing list