[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate

Sean Dague sean at dague.net
Thu Feb 2 20:19:25 UTC 2017


On 02/02/2017 02:28 PM, Armando M. wrote:
> 
> 
> On 2 February 2017 at 10:08, Sean Dague <sean at dague.net
> <mailto:sean at dague.net>> wrote:
> 
>     On 02/02/2017 12:49 PM, Armando M. wrote:
>     >
>     >
>     > On 2 February 2017 at 08:40, Sean Dague <sean at dague.net <mailto:sean at dague.net>
>     > <mailto:sean at dague.net <mailto:sean at dague.net>>> wrote:
>     >
>     >     On 02/02/2017 11:16 AM, Matthew Treinish wrote:
>     >     <snip>
>     >     > <oops, forgot to finish my though>
>     >     >
>     >     > We definitely aren't saying running a single worker is how
>     we recommend people
>     >     > run OpenStack by doing this. But it just adds on to the
>     differences between the
>     >     > gate and what we expect things actually look like.
>     >
>     >     I'm all for actually getting to the bottom of this, but
>     honestly real
>     >     memory profiling is needed here. The growth across projects
>     probably
>     >     means that some common libraries are some part of this. The
>     ever growing
>     >     requirements list is demonstrative of that. Code reuse is
>     good, but if
>     >     we are importing much of a library to get access to a couple of
>     >     functions, we're going to take a bunch of memory weight on that
>     >     (especially if that library has friendly auto imports in top level
>     >     __init__.py so we can't get only the parts we want).
>     >
>     >     Changing the worker count is just shuffling around deck chairs.
>     >
>     >     I'm not familiar enough with memory profiling tools in python
>     to know
>     >     the right approach we should take there to get this down to
>     individual
>     >     libraries / objects that are containing all our memory. Anyone
>     more
>     >     skilled here able to help lead the way?
>     >
>     >
>     > From what I hear, the overall consensus on this matter is to determine
>     > what actually caused the memory consumption bump and how to
>     address it,
>     > but that's more of a medium to long term action. In fact, to me
>     this is
>     > one of the top priority matters we should talk about at the
>     imminent PTG.
>     >
>     > For the time being, and to provide relief to the gate, should we
>     want to
>     > lock the API_WORKERS to 1? I'll post something for review and see how
>     > many people shoot it down :)
> 
>     I don't think we want to do that. It's going to force down the eventlet
>     API workers to being a single process, and it's not super clear that
>     eventlet handles backups on the inbound socket well. I honestly would
>     expect that creates different hard to debug issues, especially with high
>     chatter rates between services.
> 
> 
> I must admit I share your fear, but out of the tests that I have
> executed so far in [1,2,3], the house didn't burn in a fire. I am
> looking for other ways to have a substantial memory saving with a
> relatively quick and dirty fix, but coming up empty handed thus far.
> 
> [1] https://review.openstack.org/#/c/428303/
> [2] https://review.openstack.org/#/c/427919/
> [3] https://review.openstack.org/#/c/427921/

This failure in the first patch -
http://logs.openstack.org/03/428303/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751
looks exactly like I would expect by API Worker starvation.

	-Sean

-- 
Sean Dague
http://dague.net



More information about the OpenStack-dev mailing list