<html><head><style>body{font-family:Helvetica,Arial;font-size:13px}</style></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">Hi Johnathan,</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">This might be related to your issue.</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">I think there are two problems here.  The first problem has to do with limited page sizes in memcache.  If you have an insane number of tokens issued (and as you said neutron is making a ton of requests for new tokens), you can fill up the user’s token index (due to the way tokens are stored we need to keep track of which user the token belongs to, so we can invalidate/revoke them as appropriate).  By default, the memcache page size is 1MB.  Each token will take up approximately 32 bytes (plus some overhead for separators etc), meaning we can hold about 32k active tokens for a single user.  The second issue is that as this list increases in size, there is a higher penalty due to round-trips to memcache to validate the tokens are still valid (we try and keep this list at a “sane” size, but in some cases this performs poorly).  So, to prevent an issue with filling up the page and being completely unable to authenticate until the page falls out of memcache, we check each token for a given user for validity on issuance of a new token.</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">As long as you are not relying on long-lived tokens, you can decrease the life of the tokens by setting “expiration” setting in the “[token]” section of the configuration to a much smaller number of seconds (e.g. 3600 or so) rather than relying on the default 1 day (86400) expiration.  This means that a given token will expire sooner, and help to keep that user-token index list smaller.  If reducing the token expiration time helps significantly, then we know this is related to this bug: <a href="https://bugs.launchpad.net/keystone/+bug/1251123">https://bugs.launchpad.net/keystone/+bug/1251123</a></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">I’ll see if I can get a test-patch that you can attempt to work with that should use some of the same mechanisms I’m working on for Icehouse (specifically to address this issue).  However, it is likely that you will need to reduce the token TTL in either case. </div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">An alternative is to use the dogpile settings for token and use the SQL token backend (and add a cron to flush expired tokens out on a regular basis).  If you use the alternative, I still recommend setting the token expiration down to something in the ~3 hour range (if possible) to help keep bloat down.</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">--Morgan</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><p style="color:#A0A0A8;">On January 11, 2014 at 10:43:12, Jonathan Proulx (<a href="mailto://jon@jonproulx.com">jon@jonproulx.com</a>) wrote:</p> <blockquote type="cite" class="clean_bq"><span><div><div>Hi All,

<br>

<br>recently upgraded my 1 controller 60 compute node Ubuntu 12.04(+cloud

<br>archive) system from Grizlzy to Havana.  Now even before I let my

<br>users back to the API I'm barely able to do anything due to

<br>authentication time outs.  I am using neutron which like to

<br>authenticate *a lot*, I'm not entierly convinced the real problem

<br>isn't neutron reauthenticating a bajillion services a second, but

right now it looks like Keystone...

<br>

I'm using UUID Tokens with memcached backend.

<br>

<br>Under Grizzly I had been using Peter Feiner's multi worker patches,

<br>though this was only needed for peak loads (when starting 100's of

<br>instances).  With just background noise of the running compute-nodes

<br>(60) and instances (281) in default single worker (eventlet) mode

<br>keystone runs at 100% and many client requests (from dash board or

<br>CLI) time out.  Nova-compute nodes also frequently log timeouts when

trying to authenticate to the neutron service.

<br>

<br>Verbose keystone logs show very little activity even when running 100%

load, just a few "INFO acceess" type entries every few *minutes*.

<br>Debug logging show many sqlalchemy actions per second for the neutron

user id and admin tenant id.

<br>

<br>I took the next obvious step and put keystone behind apache, while

<br>that does get more servers running, performance if anything is even

<br>worse while using virtually all of the 12 core controller node's CPUs

rather than just one of them.

<br>

<br>The logs quickly fill with data read timeouts:

<br>

<br>2014-01-11 12:31:26.606 3054 INFO access [-] 192.168.128.43 - -

<br>[11/Jan/2014:17:31:26 +0000] "POST

<br>http://192.168.128.15:35357/v2.0/tokens HTTP/1.1" 500 167

<br>2014-01-11 12:31:26.621 3054 ERROR keystone.common.wsgi [-] request

<br>data read error

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi Traceback

<br>(most recent call last):

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File

<br>"/usr/lib/python2.7/dist-packages/keystone/common/wsgi.py", line 371,

<br>in __call__

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi     response =

<br>self.process_request(request)

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File

<br>"/usr/lib/python2.7/dist-packages/keystone/middleware/core.py", line

<br>110, in process_request

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi

<br>params_json = request.body

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File

<br>"/usr/lib/python2.7/dist-packages/webob/request.py", line 677, in

<br>_body__get

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi

<br>self.make_body_seekable() # we need this to have content_length

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File

<br>"/usr/lib/python2.7/dist-packages/webob/request.py", line 922, in

<br>make_body_seekable

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi     self.copy_body()

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File

<br>"/usr/lib/python2.7/dist-packages/webob/request.py", line 945, in

<br>copy_body

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi     self.body

<br>= self.body_file.read(self.content_length)

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi   File

<br>"/usr/lib/python2.7/dist-packages/webob/request.py", line 1521, in

<br>readinto

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi     data =

<br>self.file.read(sz0)

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi IOError:

<br>request data read error

<br>2014-01-11 12:31:26.621 3054 TRACE keystone.common.wsgi

<br>

<br>I've played around with different process and tread count numbers in

<br>the WSGIDaemonProcess apache directive, but while some seem make it

worse none have made it better.

<br>

<br>Clearly I must be doing something wrong since the single process

<br>eventlet mode has significantly better performance than the

multiprocess wsgi mode.

<br>

<br>I've also fiddled a bit with the dogpile cache settings, when running

<br>a single stand alone process the 'memory' backend seemed to make

<br>things actually go, though after getting the pylibmc backend setup (or

<br>I think setup there could well be more I'm misisng), which didn't make

<br>a noticible difference, I wasn't able to revert to the 'success' or

<br>the memory backend though for obvious reasons I wouldn't have wanted

to keep that one in production anyway.

<br>

How can I either make Keystone go faster or Neutron authenticate less?

<br>

<br>-Jon

<br>

<br>_______________________________________________

<br>Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

<br>Post to     : openstack@lists.openstack.org

<br>Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

<br></div></div></span></blockquote></body></html>