<html><head><style>body{font-family:Helvetica,Arial;font-size:13px}</style></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">Hi Jon,</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">I have published a patch set that I hope will help to address this issue: <a href="https://review.openstack.org/#/c/66149/">https://review.openstack.org/#/c/66149/</a> . If you need this in another format, please let me know.</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">The only caveat is that you should expect the maximum number of tokens per-user to drop by as much as 50% due to storing extra data about the token (avoiding the need to ask memcache about each token on issuance of a new one).  I still recommend keeping the expires time down to ~3600 seconds or so.  I also recommend doing a flush (like you did before) when deploying this change.</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">Let me know if this helps (or if you have any issues with it).  Feel free to respond via email or comment on the review.  Disclaimer: I have not performed functional performance tests on this code, just some initial cleanup and change of logic that should help minimize external calls.</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br></div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">Cheers,</div><div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color: rgba(0,0,0,1.0); margin: 0px; line-height: auto;">Morgan</div> <div id="bloop_sign_1389489579818742016" class="bloop_sign"><p><strong>—</strong><br>

<strong>Morgan Fainberg</strong><br>

Principal Software Engineer<br>

Core Developer, Keystone<br>

<a href="mailto:m@metacloud.com">m@metacloud.com</a></p></div> <br><p style="color:#A0A0A8;">On January 11, 2014 at 12:50:50, Jonathan Proulx (<a href="mailto://jon@jonproulx.com">jon@jonproulx.com</a>) wrote:</p> <blockquote type="cite" class="clean_bq"><span><div><div>Note about 30min after flushing memcache things started gumming up

<br>again (maybe 15min long tokens instead of 60min?).  Details in this

<br>response reflect the state of my world at about that point.  Recent

flush with things just starting to grind to a halt.

<br>

<br>On Sat, Jan 11, 2014 at 2:46 PM, Jay Pipes <jaypipes@gmail.com> wrote:

<br>> Hi Jonathan! I have not yet deployed Havana to production, however I'll

<br>> add some comments and suggestions below that you may want to try in

> order to isolate root cause...

<br>>

<br>> On Sat, 2014-01-11 at 13:34 -0500, Jonathan Proulx wrote:

<br>>> Hi All,

<br>>>

<br>>> recently upgraded my 1 controller 60 compute node Ubuntu 12.04(+cloud

<br>>> archive) system from Grizlzy to Havana.  Now even before I let my

<br>>> users back to the API I'm barely able to do anything due to

<br>>> authentication time outs.  I am using neutron which like to

<br>>> authenticate *a lot*, I'm not entierly convinced the real problem

<br>>> isn't neutron reauthenticating a bajillion services a second, but

>> right now it looks like Keystone...

<br>>>

>> I'm using UUID Tokens with memcached backend.

<br>>

<br>> Two requests:

<br>>

<br>> 1) Could you paste your keystone.conf pls? I'm curious, particularly,

<br>> what backend you are using for the catalog driver and the identity

<br>> driver.

<br>

<br>Incomplete but I think all the relevant bits,  I can sanitize and past

up the whole thing if you really think that would be helpful.

<br>

<br>[sql]

<br>connection    =

<br>mysql://<keystone_db_user>:<keystone_db_password>@192.168.128.15/keystone

<br>idle_timeout = 200

<br>[memcache]

<br>servers = localhost:11211

<br>[identity]

<br>driver = keystone.identity.backends.sql.Identity

<br>[trust]

<br>driver = keystone.trust.backends.sql.Trust

<br>[catalog]

<br>driver = keystone.catalog.backends.sql.Catalog

<br>[token]

<br>driver = keystone.token.backends.memcache.Token

<br>expiration = 3600

<br>provider=keystone.token.providers.uuid.Provider

<br>caching = False

<br>[cache]

<br>enabled = True

<br>backend = dogpile.cache.pylibmc

<br>backend_argument = url:127.0.0.1

<br>[policy]

<br>driver = keystone.policy.backends.sql.Policy

<br>[ec2]

<br>driver = keystone.contrib.ec2.backends.sql.Ec2

<br>

<br>

<br>> 2) How are you configuring memcache? Please let us know if you've given

<br>> memcache a decent amount of memory to support the token cache. I

<br>> believe, IIRC, our memcache-per-controller was 2GB, but we may have

<br>> increased that to 4GB... can't remember. In any case, it should be a

<br>> decent amount of memory if you have lots of token validation and/or

> revocation calls happening.

<br>

# Use <num> MB memory max to use for object storage.

<br>-m 45876

# Limit the number of simultaneous incoming connections.

<br>-c 8192

# Number of threads to use to process incoming requests.

<br>-t 24

<br>

> Do a little bit of memcache performance investigation for us, as well.

<br>> It will tell us if memcache is having trouble keeping up with constant

<br>> cache flushing...

<br>>

<br>> If you telnet to the memcached instance:

<br>>

<br>> telnet <HOST> 11211

<br>>

<br>> and then type "stats" a few times in a minute. Keep an eye on the number

<br>> returned for "evictions". If that number increases substantially over

<br>> the time you do this mini-profile, it's an indication that memcached is

<br>> having a tough time keeping up with the rate of cache insertions, and is

<br>> looking for least-used blocks to evict in order to find room to put new

<br>> blocks in. This can sometimes be an indication you need to increase the

> amount of memory for memcached.

<br>

<br>Note from previous email I recently flushed the cache, but for now

<br>evictions are holding steady at 0. For other maybe relevant memcache

<br>stats:

<br>

<br>STAT uptime 2173

<br>STAT time 1389471465

<br>STAT version 1.4.13

<br>STAT libevent 2.0.16-stable

<br>STAT pointer_size 64

<br>STAT rusage_user 48.771048

<br>STAT rusage_system 278.953433

<br>STAT curr_connections 51

<br>STAT total_connections 2274

<br>STAT connection_structures 769

<br>STAT reserved_fds 120

<br>STAT cmd_get 4719079

<br>STAT cmd_set 15122

<br>STAT cmd_flush 1

<br>STAT cmd_touch 0

<br>STAT get_hits 4716287

<br>STAT get_misses 2792

<br>...

<br>STAT bytes_read 376435142

<br>STAT bytes_written 14583390698

<br>STAT limit_maxbytes 48104472576

<br>...

<br>STAT bytes 4886925

<br>STAT curr_items 1572

<br>STAT total_items 2394

<br>STAT evictions 0

<br>STAT reclaimed 0

<br>

<br>> OK, so clearly Keystone is configured to use the sql driver for at least

> *something*... likely the identity and/or catalog driver.

<br>>

<br>> If you are using the sql driver for the catalog, I would highly advise

<br>> switching to the templated catalog driver. It's essentially a static

<br>> file that is cached by the driver and used as a string interpolation

<br>> template for the service catalog (used on EVERY single request for a new

<br>> token, since the token includes the service catalog attached to the

<br>> domain/project/user).

<br>

Yes, as you can see both identity and catalog are using sql back end.

<br>I'll take a look at the .templated.TemplatedCatalog, which may help

some, but I don't think it's the whole story (though every bit helps).

<br>

<br>>> I took the next obvious step and put keystone behind apache, while

<br>>> that does get more servers running, performance if anything is even

<br>>> worse while using virtually all of the 12 core controller node's CPUs

>> rather than just one of them.

<br>>>

<br>>> The logs quickly fill with data read timeouts:

<br><snip>

<br>> Hmm. The traceback above really doesn't seem to indict either the token

<br>> nor the catalog backend -- as the IOError is indicating that there is an

<br>> issue in *reading the incoming WSGI request body* :(

<br>>

<br>> That request body is being constructed clearly by one of the

<br>> python-xxxclients -- probably python-neutronclient -- so perhaps it is

<br>> worth doing some python-neutronclient CLI calls using --debug to see if

<br>> you can get an idea on why on Earth webob.Request is having problems

> constructing a request body from the client call. Very odd indeed.

<br>

<br>I missed a very relevant bit in my log fragment:

<br>

<br>Script timed out before returning headers: admin

<br>

<br>This is the root cause of the webob trace back, it is the neutron

<br>client as called form the compute nodes the auth failure is loged in

<br>nova-compute.log as:

<br>

<br>ERROR nova.network.neutronv2 [-] Neutron client authentication failed:

<br>Connection to neutron failed: timed out

<br>

<br>I did manage to catch a CLI debug from a similar failure, long hang followed by:

<br>

<br>jon@x1:~$ neutron --debug net-list

<br>

<br>Traceback (most recent call last):

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/shell.py", line

<br>517, in run_subcommand

<br>    return run_command(cmd, cmd_parser, sub_argv)

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/shell.py", line

<br>78, in run_command

<br>    return cmd.run(known_args)

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/common/command.py",

<br>line 35, in run

<br>    return super(OpenStackCommand, self).run(parsed_args)

<br>  File "/usr/lib/python2.7/dist-packages/cliff/display.py", line 84, in run

<br>    column_names, data = self.take_action(parsed_args)

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/common/command.py",

<br>line 41, in take_action

<br>    return self.get_data(parsed_args)

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",

<br>line 586, in get_data

<br>    data = self.retrieve_list(parsed_args)

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",

<br>line 532, in retrieve_list

<br>    neutron_client = self.get_client()

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",

<br>line 330, in get_client

<br>    return self.app.client_manager.neutron

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/common/clientmanager.py",

<br>line 41, in __get__

<br>    self._handle = self.factory(instance)

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/client.py",

<br>line 36, in make_client

<br>    instance.initialize()

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/common/clientmanager.py",

<br>line 92, in initialize

<br>    httpclient.authenticate()

<br>  File "/usr/lib/python2.7/dist-packages/neutronclient/client.py",

<br>line 237, in authenticate

<br>    raise exceptions.Unauthorized(message=body)

<br>Unauthorized:

<br>

<br>Thanks,

<br>-Jon

<br>

<br>_______________________________________________

<br>Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

<br>Post to     : openstack@lists.openstack.org

<br>Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

<br></div></div></span></blockquote></body></html>