[openstack-dev] [nova] [all] Excessively high greenlet default + excessively low connection pool defaults leads to connection pool latency, timeout errors, idle database connections / workers
mbayer at redhat.com
Fri Dec 18 20:33:57 UTC 2015
Hi all -
Let me start out with the assumptions I'm going from for what I want to
1. I'm looking at Nova right now, but I think similar things are going
on in other Openstack apps.
2. Settings that we see in nova.conf, including:
#wsgi_default_pool_size = 1000
#max_pool_size = <None>
#max_overflow = <None>
#osapi_compute_workers = <None>
#metadata_workers = <None>
are often not understood by deployers, and/or are left unchanged in a
wide variety of scenarios. If you are in fact working for deployers
that *do* change these values to something totally different, then you
might not be impacted here, and if it turns out that everyone changes
all these settings in real-world scenarios and zzzeek you are just being
silly thinking nobody sets these appropriately, then fooey for me, I guess.
3. There's talk about more Openstack services, at least Nova from what I
heard the other day, moving to be based on a real webserver deployment
in any case, the same way Keystone is. To the degree this is true
would also mitigate what I'm seeing but still, there's good changes that
can be made here.
Basically, the syndrome I want to talk about can be mostly mitigated
just by changing the numbers around in #2, but I don't really know that
people know any of this, and also I think some of the defaults here
should just be changed completely as their current values are useless in
pretty much all cases.
Suppose we run on a 24-core machine, and therefore have 24 API worker
processes. Each worker represents a WSGI server, which will use an
eventlet greenlet pool with 1000 greenlets.
Then, assuming neither max_pool_size or max_overflow is changed, this
indicates that for a single SQLAlchemy Engine, the most database
connections that are allowed by this Engine at one time is *15*.
pool_size defaults to 5 and max_overflow defaults to 10. We get our
engine from oslo.db however oslo.db does not change these defaults which
ultimately come from SQLAlchemy itself.
The problem is then basically that 1000 greenlets is way, way more than
15, meaning hundreds of requests can all pile up on a process and all be
blocked, waiting for a connection pool that's been configured to only
allow 15 database connections at most.
But wait! You say. We have twenty-four worker processes. So if we
had 100 concurrent requests, these requests would not all pile up on
just one process, they'd be distributed among the workers. Any
additional requests beyond the 15 * 24 == 360 that we can handle
(assuming a simplistic relationship between API requests and database
connections, which it is not) would just queue up as they do anyway, so
it makes no difference. Right? **Right???*
It does make a difference! Because show me in nova source code where
exactly this algorithm is that knows how to distribute requests evenly
among the workers...There is no such logic! Some months ago, I began
thinking and fretting, how the heck does this work? There's 24
workers, one socket.accept(), requests come in and sockets are somehow
divvyed up to child forks, but *how*? I asked some of the deep unix
gurus locally here, and the best answer we could come up with is: it's
Cue the mythbusters music. "Nova receives WSGI requests and sends them
to workers with a random distribution, meaning that under load, some
workers will have too many requests and be waiting on DB access which
can in fact cause pool timeout issues under very latent circumstances,
others will be more idle than they should be".
As we begin the show, we cut into a background segment where we show
that in fact, Mike and some other folks doing load testing actually
*see* connection pool timeout errors in the logs already, on a 24 core
machine, even though we see hundreds of idle connections at the same
time (just to note, the error we are talking about is "QueuePool limit
of size 5 overflow 5 reached, connection timed out, timeout 5"). So
that we actually see this happening in an actual test situation is what
led me to finally just write a test suite for the whole thing.
Here's the test suite!
https://gist.github.com/zzzeek/c69138fd0d0b3e553a1f I've tried to make
this as super-simple as possible to read, use, and understand. It uses
Nova's nova.wsgi.Server directly with a simple "hello-world" style app,
as well as oslo_service.service.Service and service.launch() the same
way I see in nova/service.py (please confirm I'm using all the right
code and things here just like in Nova, thanks!). The "hello world"
app gets a connection from the pool, does nothing with it, waits a few
seconds then returns it. All the while counting everything going on
and reporting on its metrics every 10 requests.
The "hello world" app uses a SQLAlchemy connection pool with a little
bit lower number of connections, and a timeout of only ten seconds
instead of thirty by default (but feel free to change it on the command
line), and a "work" operation that takes a random amount of time between
zero and five seconds, just to make the problem more obviously
reproducible on any hardware. When we leave the default greenlets at
1000 and hit the server with Apache ab and concurrency of at least 75,
there are connection pool timeouts galore, and the metrics also show
workers waiting anywhere from a full second to 5 seconds (before timing
out) for a database connection:
INFO:loadtest:Status for pid 32625: avg wait time for connection
worst wait time 3.9267 sec; connection failures 5;
num requests over the limit: 29; max concurrency seen: 25
ERROR:loadtest:error in pid 32630: QueuePool limit of size 5 overflow 5
reached, connection timed out, timeout 5
Bring the number of greenlets down to *ten* (yes, only ten) and the
errors go to zero, the ab test will complete the given number of
requests *faster* than it does with the 1000-greenlet version. The
average time a worker spends waiting for a database connection drops an
order of magnitude:
INFO:loadtest:Status for pid 460: avg wait time for connection
worst wait time 0.0540 sec; connection failures 0;
num requests over the limit: 0; max concurrency seen: 11
That's even though our worker's "fake" work requests are still taking as
long as 5 seconds per request to complete.
But if we only have a super low number of greenlets and only a few dozen
workers, what happens if we have more than 240 requests come in at once,
aren't those connections going to get rejected? No way! eventlet's
networking system is better than that, those connection requests just
get queued up in any case, waiting for a greenlet to be available. Play
with the script and its settings to see.
But if we're blocking any connection attempts based on what's available
at the database level, aren't we under-utilizing for API calls that need
to do a lot of other things besides DB access? The answer is that may
very well be true! Which makes the guidance more complicated based on
what service we are talking about. So here, my guidance is oriented
towards those Openstack services that are primarily doing database
access as their primary work.
Given the above caveat, I'm hoping people can look at this and verify my
assumptions and the results. Assuming I am not just drunk on eggnog,
what would my recommendations be? Basically:
1. at least for DB-oriented services, the number of 1000 greenlets
should be *way* *way* lower, and we most likely should allow for a lot
more connections to be used temporarily within a particular worker,
which means I'd take the max_overflow setting and default it to like 50,
or 100. The Greenlet number should then be very similar to the
max_overflow number, and maybe even a little less, as Nova API calls
right now often will use more than one connection concurrently.
2. longer term, let's please drop the eventlet pool thing and just use a
real web server! (but still tune the connection pool appropriately). A
real web server will at least know how to efficiently direct requests to
worker processes. If all Openstack workers were configurable under a
single web server config, that would also be a nice way to centralize
tuning and profiling overall.
Thanks for reading!
More information about the OpenStack-dev