[openstack-dev] [nova] [all] Excessively high greenlet default + excessively low connection pool defaults leads to connection pool latency, timeout errors, idle database connections / workers

Roman Podoliaka rpodolyaka at mirantis.com
Thu Jan 7 07:49:26 UTC 2016


Actually we already do that in the parent process. The parent process:

1) starts and creates a socket

2) binds the socket and calls listen() on it passing the backlog value
(http://linux.die.net/man/2/listen)

3) passes the socket to the eventlet WSGI server
(https://github.com/openstack/oslo.service/blob/master/oslo_service/wsgi.py#L177-L192)

4) forks $*_workers times (child processes inherit all open file
descriptors including the socket one)

5) child processes call accept() in a loop

Linux gurus please correct me here, but my understanding is that Linux
kernel queues up to $backlog number of connections *per socket*. In
our case child processes inherited the FD of the socket, so they will
accept() connections from the same queue in the kernel, i.e. the
backlog value is for *all* child processes, not *per* process.

>>>  E.g. all workers are saturated, it will place a waiting connection onto a random greenlet which then has to wait?

In each child process eventlet WSGI server calls accept() in a loop to
get a client socket from the kernel and then puts into a greenlet from
a pool for processing:

https://github.com/eventlet/eventlet/blob/master/eventlet/wsgi.py#L846-L853

The "saturation" point for a child process in our case will be when we
run out of available greenlets in the pool, so that pool.spawn_n()
call will block and it won't call accept() anymore, until one or more
greenlets finishes processing of previous requests.

Or, a particular greenlet can do a blocking call, which won't yield
the execution context back to the event loop, so that eventlet WSGI
server green thread won't get a chance  to be executed and call
accept() (e.g. a call to MySQL-Python without tpool).

The kernel will queue up to $backlog connections for us until we call
accept() in one of the child processes.

On Thu, Jan 7, 2016 at 12:02 AM, Mike Bayer <mbayer at redhat.com> wrote:
>
>
> On 01/06/2016 09:11 AM, Roman Podoliaka wrote:
>> Hi Mike,
>>
>> Thank you for this brilliant analysis! We've been seeing such timeout
>> errors in downstream periodically and this is the first time someone
>> has analysed the root cause thoroughly.
>>
>> On Fri, Dec 18, 2015 at 10:33 PM, Mike Bayer <mbayer at redhat.com> wrote:
>>
>>> But if we only have a super low number of greenlets and only a few dozen
>>> workers, what happens if we have more than 240 requests come in at once,
>>> aren't those connections going to get rejected?  No way!  eventlet's
>>> networking system is better than that, those connection requests just
>>> get queued up in any case, waiting for a greenlet to be available.  Play
>>> with the script and its settings to see.
>>
>> Right, it must be controlled by the backlog argument value here:
>>
>> https://github.com/openstack/oslo.service/blob/master/oslo_service/wsgi.py#L80
>
> oh wow, totally missed that!  But, how does backlog here interact with
> multiple processes?   E.g. all workers are saturated, it will place a
> waiting connection onto a random greenlet which then has to wait?  It
> would be better if the "backlog" were pushed up to the parent process,
> not sure if that's possible?
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list