[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

Sean Dague sean at dague.net
Thu Feb 2 20:50:22 UTC 2017


On 02/02/2017 03:32 PM, Armando M. wrote:
> 
> 
> On 2 February 2017 at 12:19, Sean Dague <sean at dague.net
> <mailto:sean at dague.net>> wrote:
> 
>     On 02/02/2017 02:28 PM, Armando M. wrote:
>     >
>     >
>     > On 2 February 2017 at 10:08, Sean Dague <sean at dague.net <mailto:sean at dague.net>
>     > <mailto:sean at dague.net <mailto:sean at dague.net>>> wrote:
>     >
>     >     On 02/02/2017 12:49 PM, Armando M. wrote:
>     >     >
>     >     >
>     >     > On 2 February 2017 at 08:40, Sean Dague <sean at dague.net <mailto:sean at dague.net> <mailto:sean at dague.net
>     <mailto:sean at dague.net>>
>     >     > <mailto:sean at dague.net <mailto:sean at dague.net>
>     <mailto:sean at dague.net <mailto:sean at dague.net>>>> wrote:
>     >     >
>     >     >     On 02/02/2017 11:16 AM, Matthew Treinish wrote:
>     >     >     <snip>
>     >     >     > <oops, forgot to finish my though>
>     >     >     >
>     >     >     > We definitely aren't saying running a single worker is how
>     >     we recommend people
>     >     >     > run OpenStack by doing this. But it just adds on to the
>     >     differences between the
>     >     >     > gate and what we expect things actually look like.
>     >     >
>     >     >     I'm all for actually getting to the bottom of this, but
>     >     honestly real
>     >     >     memory profiling is needed here. The growth across projects
>     >     probably
>     >     >     means that some common libraries are some part of this. The
>     >     ever growing
>     >     >     requirements list is demonstrative of that. Code reuse is
>     >     good, but if
>     >     >     we are importing much of a library to get access to a
>     couple of
>     >     >     functions, we're going to take a bunch of memory weight
>     on that
>     >     >     (especially if that library has friendly auto imports in
>     top level
>     >     >     __init__.py so we can't get only the parts we want).
>     >     >
>     >     >     Changing the worker count is just shuffling around deck
>     chairs.
>     >     >
>     >     >     I'm not familiar enough with memory profiling tools in
>     python
>     >     to know
>     >     >     the right approach we should take there to get this down to
>     >     individual
>     >     >     libraries / objects that are containing all our memory.
>     Anyone
>     >     more
>     >     >     skilled here able to help lead the way?
>     >     >
>     >     >
>     >     > From what I hear, the overall consensus on this matter is to
>     determine
>     >     > what actually caused the memory consumption bump and how to
>     >     address it,
>     >     > but that's more of a medium to long term action. In fact, to me
>     >     this is
>     >     > one of the top priority matters we should talk about at the
>     >     imminent PTG.
>     >     >
>     >     > For the time being, and to provide relief to the gate, should we
>     >     want to
>     >     > lock the API_WORKERS to 1? I'll post something for review
>     and see how
>     >     > many people shoot it down :)
>     >
>     >     I don't think we want to do that. It's going to force down the
>     eventlet
>     >     API workers to being a single process, and it's not super
>     clear that
>     >     eventlet handles backups on the inbound socket well. I
>     honestly would
>     >     expect that creates different hard to debug issues, especially
>     with high
>     >     chatter rates between services.
>     >
>     >
>     > I must admit I share your fear, but out of the tests that I have
>     > executed so far in [1,2,3], the house didn't burn in a fire. I am
>     > looking for other ways to have a substantial memory saving with a
>     > relatively quick and dirty fix, but coming up empty handed thus far.
>     >
>     > [1] https://review.openstack.org/#/c/428303/
>     <https://review.openstack.org/#/c/428303/>
>     > [2] https://review.openstack.org/#/c/427919/
>     <https://review.openstack.org/#/c/427919/>
>     > [3] https://review.openstack.org/#/c/427921/
>     <https://review.openstack.org/#/c/427921/>
> 
>     This failure in the first patch -
>     http://logs.openstack.org/03/428303/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751
>     <http://logs.openstack.org/03/428303/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751>
>     looks exactly like I would expect by API Worker starvation.
> 
> 
> Not sure I agree on this one, this has been observed multiple times in
> the gate already [1] (though I am not sure there's a bug for it), and I
> don't believe it has anything to do with the number of API workers,
> unless not even two workers are enough.

There is no guarntee that 2 workers are enough. I'm not surprised if we
see that failure some today. This was all guess work on trimming worker
counts to deal with the memory issue in the past. But we're running
tests in parallel, and the services are making calls back to other
services all the time.

This is one of the reasons to get the wsgi stack off of eventlet and
into a real webserver, as they handle HTTP request backups much much better.

I do understand that people want a quick fix here, but I'm not convinced
that it exists.

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list