Open Stack

Sat Jan 11 19:46:44 UTC 2014

Hi Jonathan! I have not yet deployed Havana to production, however I'll
add some comments and suggestions below that you may want to try in
order to isolate root cause...

On Sat, 2014-01-11 at 13:34 -0500, Jonathan Proulx wrote:
> Hi All,
> 
> recently upgraded my 1 controller 60 compute node Ubuntu 12.04(+cloud
> archive) system from Grizlzy to Havana.  Now even before I let my
> users back to the API I'm barely able to do anything due to
> authentication time outs.  I am using neutron which like to
> authenticate *a lot*, I'm not entierly convinced the real problem
> isn't neutron reauthenticating a bajillion services a second, but
> right now it looks like Keystone...
> 
> I'm using UUID Tokens with memcached backend.

Two requests:

1) Could you paste your keystone.conf pls? I'm curious, particularly,
what backend you are using for the catalog driver and the identity
driver.

2) How are you configuring memcache? Please let us know if you've given
memcache a decent amount of memory to support the token cache. I
believe, IIRC, our memcache-per-controller was 2GB, but we may have
increased that to 4GB... can't remember. In any case, it should be a
decent amount of memory if you have lots of token validation and/or
revocation calls happening.

Do a little bit of memcache performance investigation for us, as well.
It will tell us if memcache is having trouble keeping up with constant
cache flushing...

If you telnet to the memcached instance:

telnet <HOST> 11211

and then type "stats" a few times in a minute. Keep an eye on the number
returned for "evictions". If that number increases substantially over
the time you do this mini-profile, it's an indication that memcached is
having a tough time keeping up with the rate of cache insertions, and is
looking for least-used blocks to evict in order to find room to put new
blocks in. This can sometimes be an indication you need to increase the
amount of memory for memcached.

> Under Grizzly I had been using Peter Feiner's multi worker patches,
> though this was only needed for peak loads (when starting 100's of
> instances).  With just background noise of the running compute-nodes
> (60) and instances (281) in default single worker (eventlet) mode
> keystone runs at 100% and many client requests (from dash board or
> CLI) time out.  Nova-compute nodes also frequently log timeouts when
> trying to authenticate to the neutron service.
> 
> Verbose keystone logs show very little activity even when running 100%
> load, just a few "INFO acceess" type entries every few *minutes*.
> Debug logging show many sqlalchemy actions per second for the neutron
> user id and admin tenant id.

OK, so clearly Keystone is configured to use the sql driver for at least
*something*... likely the identity and/or catalog driver.

If you are using the sql driver for the catalog, I would highly advise
switching to the templated catalog driver. It's essentially a static
file that is cached by the driver and used as a string interpolation
template for the service catalog (used on EVERY single request for a new
token, since the token includes the service catalog attached to the
domain/project/user).

> I took the next obvious step and put keystone behind apache, while
> that does get more servers running, performance if anything is even
> worse while using virtually all of the 12 core controller node's CPUs
> rather than just one of them.
> 
> The logs quickly fill with data read timeouts:
> 
> 2014-01-11 12:31:26.606 3054 INFO access [-] 192.168.128.43 - -
> [11/Jan/2014:17:31:26 +0000] "POST
> http://192.168.128.15:35357/v2.0/tokens HTTP/1.1" 500 167
> 2014-01-11 12:31:26.621 3054 ERROR keystone.common.wsgi [-] request
> data read error
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi Traceback
> (most recent call last):
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File
> "/usr/lib/python2.7/dist-packages/keystone/common/wsgi.py", line 371,
> in __call__
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi     response =
> self.process_request(request)
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File
> "/usr/lib/python2.7/dist-packages/keystone/middleware/core.py", line
> 110, in process_request
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi
> params_json = request.body
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File
> "/usr/lib/python2.7/dist-packages/webob/request.py", line 677, in
> _body__get
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi
> self.make_body_seekable() # we need this to have content_length
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File
> "/usr/lib/python2.7/dist-packages/webob/request.py", line 922, in
> make_body_seekable
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi     self.copy_body()
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File
> "/usr/lib/python2.7/dist-packages/webob/request.py", line 945, in
> copy_body
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi     self.body
> = self.body_file.read(self.content_length)
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File
> "/usr/lib/python2.7/dist-packages/webob/request.py", line 1521, in
> readinto
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi     data =
> self.file.read(sz0)
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi IOError:
> request data read error
> 2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi
> 
> I've played around with different process and tread count numbers in
> the WSGIDaemonProcess apache directive, but while some seem make it
> worse none have made it better.

Hmm. The traceback above really doesn't seem to indict either the token
nor the catalog backend -- as the IOError is indicating that there is an
issue in *reading the incoming WSGI request body* :(

That request body is being constructed clearly by one of the
python-xxxclients -- probably python-neutronclient -- so perhaps it is
worth doing some python-neutronclient CLI calls using --debug to see if
you can get an idea on why on Earth webob.Request is having problems
constructing a request body from the client call. Very odd indeed.

Best,
-jay

> Clearly I must be doing something wrong since the single process
> eventlet mode has significantly better performance than the
> multiprocess wsgi mode.
> 
> I've also fiddled a bit with the dogpile cache settings, when running
> a single stand alone process the 'memory' backend seemed to make
> things actually go, though after getting the pylibmc backend setup (or
> I think setup there could well be more I'm misisng), which didn't make
> a noticible difference, I wasn't able to revert to the 'success' or
> the memory backend though for obvious reasons I wouldn't have wanted
> to keep that one in production anyway.
> 
> How can I either make Keystone go faster or Neutron authenticate less?
> 
> -Jon
> 
> _______________________________________________
> Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
> Post to     : openstack at lists.openstack.org
> Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Open Stack

[Openstack] [Keystone] performance issues after havana upgrade

OpenStack

Community

Documentation

Branding & Legal