[openstack-dev] [infra] [gate] [all] openstack services footprint lead to oom-kill in the gate
Kevin Benton
kevin at benton.pub
Thu Feb 2 21:48:05 UTC 2017
Note the HTTPS in the traceback in the bug report. Also the mention of
adjusting the Apache mpm settings to fix it. That seems to point to an
issue with Apache in the middle rather than eventlet and API_WORKERS.
On Feb 2, 2017 14:36, "Ihar Hrachyshka" <ihrachys at redhat.com> wrote:
> The BadStatusLine error is well known:
> https://bugs.launchpad.net/nova/+bug/1630664
>
> Now, it doesn't mean that the root cause of the error message is the
> same, and it may as well be that lowering the number of workers
> triggered it. All I am saying is we saw that error in the past.
>
> Ihar
>
> On Thu, Feb 2, 2017 at 1:07 PM, Kevin Benton <kevin at benton.pub> wrote:
> > This error seems to be new in the ocata cycle. It's either related to a
> > dependency change or the fact that we put Apache in between the services
> > now. Handling more concurrent requests than workers wasn't an issue
> before.
> >
> > It seems that you are suggesting that eventlet can't handle concurrent
> > connections, which is the entire purpose of the library, no?
> >
> > On Feb 2, 2017 13:53, "Sean Dague" <sean at dague.net> wrote:
> >>
> >> On 02/02/2017 03:32 PM, Armando M. wrote:
> >> >
> >> >
> >> > On 2 February 2017 at 12:19, Sean Dague <sean at dague.net
> >> > <mailto:sean at dague.net>> wrote:
> >> >
> >> > On 02/02/2017 02:28 PM, Armando M. wrote:
> >> > >
> >> > >
> >> > > On 2 February 2017 at 10:08, Sean Dague <sean at dague.net
> >> > <mailto:sean at dague.net>
> >> > > <mailto:sean at dague.net <mailto:sean at dague.net>>> wrote:
> >> > >
> >> > > On 02/02/2017 12:49 PM, Armando M. wrote:
> >> > > >
> >> > > >
> >> > > > On 2 February 2017 at 08:40, Sean Dague <sean at dague.net
> >> > <mailto:sean at dague.net> <mailto:sean at dague.net
> >> > <mailto:sean at dague.net>>
> >> > > > <mailto:sean at dague.net <mailto:sean at dague.net>
> >> > <mailto:sean at dague.net <mailto:sean at dague.net>>>> wrote:
> >> > > >
> >> > > > On 02/02/2017 11:16 AM, Matthew Treinish wrote:
> >> > > > <snip>
> >> > > > > <oops, forgot to finish my though>
> >> > > > >
> >> > > > > We definitely aren't saying running a single worker
> is
> >> > how
> >> > > we recommend people
> >> > > > > run OpenStack by doing this. But it just adds on to
> >> > the
> >> > > differences between the
> >> > > > > gate and what we expect things actually look like.
> >> > > >
> >> > > > I'm all for actually getting to the bottom of this,
> but
> >> > > honestly real
> >> > > > memory profiling is needed here. The growth across
> >> > projects
> >> > > probably
> >> > > > means that some common libraries are some part of
> this.
> >> > The
> >> > > ever growing
> >> > > > requirements list is demonstrative of that. Code reuse
> >> > is
> >> > > good, but if
> >> > > > we are importing much of a library to get access to a
> >> > couple of
> >> > > > functions, we're going to take a bunch of memory
> weight
> >> > on that
> >> > > > (especially if that library has friendly auto imports
> in
> >> > top level
> >> > > > __init__.py so we can't get only the parts we want).
> >> > > >
> >> > > > Changing the worker count is just shuffling around
> deck
> >> > chairs.
> >> > > >
> >> > > > I'm not familiar enough with memory profiling tools in
> >> > python
> >> > > to know
> >> > > > the right approach we should take there to get this
> down
> >> > to
> >> > > individual
> >> > > > libraries / objects that are containing all our
> memory.
> >> > Anyone
> >> > > more
> >> > > > skilled here able to help lead the way?
> >> > > >
> >> > > >
> >> > > > From what I hear, the overall consensus on this matter is
> to
> >> > determine
> >> > > > what actually caused the memory consumption bump and how
> to
> >> > > address it,
> >> > > > but that's more of a medium to long term action. In fact,
> to
> >> > me
> >> > > this is
> >> > > > one of the top priority matters we should talk about at
> the
> >> > > imminent PTG.
> >> > > >
> >> > > > For the time being, and to provide relief to the gate,
> >> > should we
> >> > > want to
> >> > > > lock the API_WORKERS to 1? I'll post something for review
> >> > and see how
> >> > > > many people shoot it down :)
> >> > >
> >> > > I don't think we want to do that. It's going to force down
> the
> >> > eventlet
> >> > > API workers to being a single process, and it's not super
> >> > clear that
> >> > > eventlet handles backups on the inbound socket well. I
> >> > honestly would
> >> > > expect that creates different hard to debug issues,
> especially
> >> > with high
> >> > > chatter rates between services.
> >> > >
> >> > >
> >> > > I must admit I share your fear, but out of the tests that I have
> >> > > executed so far in [1,2,3], the house didn't burn in a fire. I
> am
> >> > > looking for other ways to have a substantial memory saving with
> a
> >> > > relatively quick and dirty fix, but coming up empty handed thus
> >> > far.
> >> > >
> >> > > [1] https://review.openstack.org/#/c/428303/
> >> > <https://review.openstack.org/#/c/428303/>
> >> > > [2] https://review.openstack.org/#/c/427919/
> >> > <https://review.openstack.org/#/c/427919/>
> >> > > [3] https://review.openstack.org/#/c/427921/
> >> > <https://review.openstack.org/#/c/427921/>
> >> >
> >> > This failure in the first patch -
> >> >
> >> > http://logs.openstack.org/03/428303/1/check/gate-tempest-
> dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-
> api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751
> >> >
> >> > <http://logs.openstack.org/03/428303/1/check/gate-tempest-
> dsvm-neutron-full-ubuntu-xenial/71f42ea/logs/screen-n-
> api.txt.gz?level=TRACE#_2017-02-02_19_14_11_751>
> >> > looks exactly like I would expect by API Worker starvation.
> >> >
> >> >
> >> > Not sure I agree on this one, this has been observed multiple times in
> >> > the gate already [1] (though I am not sure there's a bug for it), and
> I
> >> > don't believe it has anything to do with the number of API workers,
> >> > unless not even two workers are enough.
> >>
> >> There is no guarntee that 2 workers are enough. I'm not surprised if we
> >> see that failure some today. This was all guess work on trimming worker
> >> counts to deal with the memory issue in the past. But we're running
> >> tests in parallel, and the services are making calls back to other
> >> services all the time.
> >>
> >> This is one of the reasons to get the wsgi stack off of eventlet and
> >> into a real webserver, as they handle HTTP request backups much much
> >> better.
> >>
> >> I do understand that people want a quick fix here, but I'm not convinced
> >> that it exists.
> >>
> >> -Sean
> >>
> >> --
> >> Sean Dague
> >> http://dague.net
> >>
> >> ____________________________________________________________
> ______________
> >> OpenStack Development Mailing List (not for usage questions)
> >> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:
> unsubscribe
> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
> >
> > ____________________________________________________________
> ______________
> > OpenStack Development Mailing List (not for usage questions)
> > Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:
> unsubscribe
> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
> >
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170202/16b8122c/attachment.html>
More information about the OpenStack-dev
mailing list