[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate
Armando M.
armamig at gmail.com
Thu Feb 2 21:35:32 UTC 2017
On 2 February 2017 at 12:50, Sean Dague <sean at dague.net> wrote:
> On 02/02/2017 03:32 PM, Armando M. wrote:
> >
> >
> > On 2 February 2017 at 12:19, Sean Dague <sean at dague.net
> > <mailto:sean at dague.net>> wrote:
> >
> > On 02/02/2017 02:28 PM, Armando M. wrote:
> > >
> > >
> > > On 2 February 2017 at 10:08, Sean Dague <sean at dague.net <mailto:
> sean at dague.net>
> > > <mailto:sean at dague.net <mailto:sean at dague.net>>> wrote:
> > >
> > > On 02/02/2017 12:49 PM, Armando M. wrote:
> > > >
> > > >
> > > > On 2 February 2017 at 08:40, Sean Dague <sean at dague.net
> <mailto:sean at dague.net> <mailto:sean at dague.net
> > <mailto:sean at dague.net>>
> > > > <mailto:sean at dague.net <mailto:sean at dague.net>
> > <mailto:sean at dague.net <mailto:sean at dague.net>>>> wrote:
> > > >
> > > > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> > > > <snip>
> > > > > <oops, forgot to finish my though>
> > > > >
> > > > > We definitely aren't saying running a single worker is
> how
> > > we recommend people
> > > > > run OpenStack by doing this. But it just adds on to the
> > > differences between the
> > > > > gate and what we expect things actually look like.
> > > >
> > > > I'm all for actually getting to the bottom of this, but
> > > honestly real
> > > > memory profiling is needed here. The growth across
> projects
> > > probably
> > > > means that some common libraries are some part of this.
> The
> > > ever growing
> > > > requirements list is demonstrative of that. Code reuse is
> > > good, but if
> > > > we are importing much of a library to get access to a
> > couple of
> > > > functions, we're going to take a bunch of memory weight
> > on that
> > > > (especially if that library has friendly auto imports in
> > top level
> > > > __init__.py so we can't get only the parts we want).
> > > >
> > > > Changing the worker count is just shuffling around deck
> > chairs.
> > > >
> > > > I'm not familiar enough with memory profiling tools in
> > python
> > > to know
> > > > the right approach we should take there to get this down
> to
> > > individual
> > > > libraries / objects that are containing all our memory.
> > Anyone
> > > more
> > > > skilled here able to help lead the way?
> > > >
> > > >
> > > > From what I hear, the overall consensus on this matter is to
> > determine
> > > > what actually caused the memory consumption bump and how to
> > > address it,
> > > > but that's more of a medium to long term action. In fact, to
> me
> > > this is
> > > > one of the top priority matters we should talk about at the
> > > imminent PTG.
> > > >
> > > > For the time being, and to provide relief to the gate,
> should we
> > > want to
> > > > lock the API_WORKERS to 1? I'll post something for review
> > and see how
> > > > many people shoot it down :)
> > >
> > > I don't think we want to do that. It's going to force down the
> > eventlet
> > > API workers to being a single process, and it's not super
> > clear that
> > > eventlet handles backups on the inbound socket well. I
> > honestly would
> > > expect that creates different hard to debug issues, especially
> > with high
> > > chatter rates between services.
> > >
> > >
> > > I must admit I share your fear, but out of the tests that I have
> > > executed so far in [1,2,3], the house didn't burn in a fire. I am
> > > looking for other ways to have a substantial memory saving with a
> > > relatively quick and dirty fix, but coming up empty handed thus
> far.
> > >
> > > [1] https://review.openstack.org/#/c/428303/
> > <https://review.openstack.org/#/c/428303/>
> > > [2] https://review.openstack.org/#/c/427919/
> > <https://review.openstack.org/#/c/427919/>
> > > [3] https://review.openstack.org/#/c/427921/
> > <https://review.openstack.org/#/c/427921/>
> >
> > This failure in the first patch -
> > http://logs.openstack.org/03/428303/1/check/gate-tempest-
> dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-
> api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751
> > <http://logs.openstack.org/03/428303/1/check/gate-tempest-
> dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-
> api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751>
> > looks exactly like I would expect by API Worker starvation.
> >
> >
> > Not sure I agree on this one, this has been observed multiple times in
> > the gate already [1] (though I am not sure there's a bug for it), and I
> > don't believe it has anything to do with the number of API workers,
> > unless not even two workers are enough.
>
> There is no guarntee that 2 workers are enough. I'm not surprised if we
> see that failure some today. This was all guess work on trimming worker
> counts to deal with the memory issue in the past. But we're running
> tests in parallel, and the services are making calls back to other
> services all the time.
>
> This is one of the reasons to get the wsgi stack off of eventlet and
> into a real webserver, as they handle HTTP request backups much much
> better.
>
> I do understand that people want a quick fix here, but I'm not convinced
> that it exists.
>
Fair enough. The main intent of this conversation for me was to spur debate
and gather opinions. So long as we agree that fixing memory hunger is a
concerted effort and that we cannot let one service go on a diet whereas
another goes binge eat, I am ok limping along for as long as it takes to
bring things back into shape.
>
> -Sean
>
> --
> Sean Dague
> http://dague.net
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170202/aa2e12db/attachment.html>
More information about the OpenStack-dev
mailing list