Open Stack

Wed Jan 6 14:11:35 UTC 2016

Hi Mike,

Thank you for this brilliant analysis! We've been seeing such timeout
errors in downstream periodically and this is the first time someone
has analysed the root cause thoroughly.

On Fri, Dec 18, 2015 at 10:33 PM, Mike Bayer <mbayer at redhat.com> wrote:
> Hi all -
>
> Let me start out with the assumptions I'm going from for what I want to
> talk about.
>
> 1. I'm looking at Nova right now, but I think similar things are going
> on in other Openstack apps.
>
> 2. Settings that we see in nova.conf, including:
>
> #wsgi_default_pool_size = 1000
> #max_pool_size = <None>
> #max_overflow = <None>
> #osapi_compute_workers = <None>
> #metadata_workers = <None>
>
>
> are often not understood by deployers, and/or are left unchanged in a
> wide variety of scenarios.    If you are in fact working for deployers
> that *do* change these values to something totally different, then you
> might not be impacted here, and if it turns out that everyone changes
> all these settings in real-world scenarios and zzzeek you are just being
> silly thinking nobody sets these appropriately, then fooey for me, I guess.

My understanding is that DB connection pool / workers number options
are usually changed, while the number of eventlet greenlets is not:

http://codesearch.openstack.org/?q=wsgi_default_pool_size&i=nope&files=&repos=
http://codesearch.openstack.org/?q=max_pool_size&i=nope&files=&repos=

I think it's for "historical" reasons when MySQL-Python was considered
to be the default DB API driver and we had to work around its
concurrency issues with eventlet by using multiple forks of services.

But as you point out even with a non-blocking DB API driver like
pymysql we are still having problems with timeouts due to pool vs
greenlets number settings.

> 3. There's talk about more Openstack services, at least Nova from what I
> heard the other day, moving to be based on a real webserver deployment
> in any case, the same way Keystone is.   To the degree this is true
> would also mitigate what I'm seeing but still, there's good changes that
> can be made here.

I think, ideally we'd like to have "wsgi container agnostic" apps not
coupled to eventlet or anything else - so that it will be up to a
deployer to choose the application server.

> But if we only have a super low number of greenlets and only a few dozen
> workers, what happens if we have more than 240 requests come in at once,
> aren't those connections going to get rejected?  No way!  eventlet's
> networking system is better than that, those connection requests just
> get queued up in any case, waiting for a greenlet to be available.  Play
> with the script and its settings to see.

Right, it must be controlled by the backlog argument value here:

https://github.com/openstack/oslo.service/blob/master/oslo_service/wsgi.py#L80

> But if we're blocking any connection attempts based on what's available
> at the database level, aren't we under-utilizing for API calls that need
> to do a lot of other things besides DB access?  The answer is that may
> very well be true!   Which makes the guidance more complicated based on
> what service we are talking about.   So here, my guidance is oriented
> towards those Openstack services that are primarily doing database
> access as their primary work.

I believe, all our APIs are pretty much DB oriented.

> Given the above caveat, I'm hoping people can look at this and verify my
> assumptions and the results.    Assuming I am not just drunk on eggnog,
> what would my recommendations be?  Basically:
>
> 1. at least for DB-oriented services, the number of 1000 greenlets
> should be *way* *way* lower, and we most likely should allow for a lot
> more connections to be used temporarily within a particular worker,
> which means I'd take the max_overflow setting and default it to like 50,
> or 100.   The Greenlet number should then be very similar to the
> max_overflow number, and maybe even a little less, as Nova API calls
> right now often will use more than one connection concurrently.

I suggest we tweak the config options values in both oslo.service and
oslo.db to provide reasonable production defaults and document the
"correlation" between DB connection pool / greenlet workers number
settings.

> 2. longer term, let's please drop the eventlet pool thing and just use a
> real web server!  (but still tune the connection pool appropriately).  A
> real web server will at least know how to efficiently direct requests to
> worker processes.   If all Openstack workers were configurable under a
> single web server config, that would also be a nice way to centralize
> tuning and profiling overall.

I'd rather we simply not couple to eventlet unconditionally and allow
deployers to choose the WSGI container they want to use.

Thanks,
Roman

Open Stack

[openstack-dev] [nova] [all] Excessively high greenlet default + excessively low connection pool defaults leads to connection pool latency, timeout errors, idle database connections / workers

OpenStack

Community

Documentation

Branding & Legal