Open Stack

Sun Jan 12 01:24:10 UTC 2014

Hi Jon,

I have published a patch set that I hope will help to address this issue: https://review.openstack.org/#/c/66149/ . If you need this in another format, please let me know.

The only caveat is that you should expect the maximum number of tokens per-user to drop by as much as 50% due to storing extra data about the token (avoiding the need to ask memcache about each token on issuance of a new one).  I still recommend keeping the expires time down to ~3600 seconds or so.  I also recommend doing a flush (like you did before) when deploying this change.

Let me know if this helps (or if you have any issues with it).  Feel free to respond via email or comment on the review.  Disclaimer: I have not performed functional performance tests on this code, just some initial cleanup and change of logic that should help minimize external calls.

Cheers,
Morgan
—
Morgan Fainberg
Principal Software Engineer
Core Developer, Keystone
m at metacloud.com

On January 11, 2014 at 12:50:50, Jonathan Proulx (jon at jonproulx.com) wrote:

Note about 30min after flushing memcache things started gumming up  
again (maybe 15min long tokens instead of 60min?). Details in this  
response reflect the state of my world at about that point. Recent  
flush with things just starting to grind to a halt.  

On Sat, Jan 11, 2014 at 2:46 PM, Jay Pipes <jaypipes at gmail.com> wrote:  
> Hi Jonathan! I have not yet deployed Havana to production, however I'll  
> add some comments and suggestions below that you may want to try in  
> order to isolate root cause...  
>  
> On Sat, 2014-01-11 at 13:34 -0500, Jonathan Proulx wrote:  
>> Hi All,  
>>  
>> recently upgraded my 1 controller 60 compute node Ubuntu 12.04(+cloud  
>> archive) system from Grizlzy to Havana. Now even before I let my  
>> users back to the API I'm barely able to do anything due to  
>> authentication time outs. I am using neutron which like to  
>> authenticate *a lot*, I'm not entierly convinced the real problem  
>> isn't neutron reauthenticating a bajillion services a second, but  
>> right now it looks like Keystone...  
>>  
>> I'm using UUID Tokens with memcached backend.  
>  
> Two requests:  
>  
> 1) Could you paste your keystone.conf pls? I'm curious, particularly,  
> what backend you are using for the catalog driver and the identity  
> driver.  

Incomplete but I think all the relevant bits, I can sanitize and past  
up the whole thing if you really think that would be helpful.  

[sql]  
connection =  
mysql://<keystone_db_user>:<keystone_db_password>@192.168.128.15/keystone  
idle_timeout = 200  
[memcache]  
servers = localhost:11211  
[identity]  
driver = keystone.identity.backends.sql.Identity  
[trust]  
driver = keystone.trust.backends.sql.Trust  
[catalog]  
driver = keystone.catalog.backends.sql.Catalog  
[token]  
driver = keystone.token.backends.memcache.Token  
expiration = 3600  
provider=keystone.token.providers.uuid.Provider  
caching = False  
[cache]  
enabled = True  
backend = dogpile.cache.pylibmc  
backend_argument = url:127.0.0.1  
[policy]  
driver = keystone.policy.backends.sql.Policy  
[ec2]  
driver = keystone.contrib.ec2.backends.sql.Ec2  

> 2) How are you configuring memcache? Please let us know if you've given  
> memcache a decent amount of memory to support the token cache. I  
> believe, IIRC, our memcache-per-controller was 2GB, but we may have  
> increased that to 4GB... can't remember. In any case, it should be a  
> decent amount of memory if you have lots of token validation and/or  
> revocation calls happening.  

# Use <num> MB memory max to use for object storage.  
-m 45876  
# Limit the number of simultaneous incoming connections.  
-c 8192  
# Number of threads to use to process incoming requests.  
-t 24  

> Do a little bit of memcache performance investigation for us, as well.  
> It will tell us if memcache is having trouble keeping up with constant  
> cache flushing...  
>  
> If you telnet to the memcached instance:  
>  
> telnet <HOST> 11211  
>  
> and then type "stats" a few times in a minute. Keep an eye on the number  
> returned for "evictions". If that number increases substantially over  
> the time you do this mini-profile, it's an indication that memcached is  
> having a tough time keeping up with the rate of cache insertions, and is  
> looking for least-used blocks to evict in order to find room to put new  
> blocks in. This can sometimes be an indication you need to increase the  
> amount of memory for memcached.  

Note from previous email I recently flushed the cache, but for now  
evictions are holding steady at 0. For other maybe relevant memcache  
stats:  

STAT uptime 2173  
STAT time 1389471465  
STAT version 1.4.13  
STAT libevent 2.0.16-stable  
STAT pointer_size 64  
STAT rusage_user 48.771048  
STAT rusage_system 278.953433  
STAT curr_connections 51  
STAT total_connections 2274  
STAT connection_structures 769  
STAT reserved_fds 120  
STAT cmd_get 4719079  
STAT cmd_set 15122  
STAT cmd_flush 1  
STAT cmd_touch 0  
STAT get_hits 4716287  
STAT get_misses 2792  
...  
STAT bytes_read 376435142  
STAT bytes_written 14583390698  
STAT limit_maxbytes 48104472576  
...  
STAT bytes 4886925  
STAT curr_items 1572  
STAT total_items 2394  
STAT evictions 0  
STAT reclaimed 0  

> OK, so clearly Keystone is configured to use the sql driver for at least  
> *something*... likely the identity and/or catalog driver.  
>  
> If you are using the sql driver for the catalog, I would highly advise  
> switching to the templated catalog driver. It's essentially a static  
> file that is cached by the driver and used as a string interpolation  
> template for the service catalog (used on EVERY single request for a new  
> token, since the token includes the service catalog attached to the  
> domain/project/user).  

Yes, as you can see both identity and catalog are using sql back end.  
I'll take a look at the .templated.TemplatedCatalog, which may help  
some, but I don't think it's the whole story (though every bit helps).  

>> I took the next obvious step and put keystone behind apache, while  
>> that does get more servers running, performance if anything is even  
>> worse while using virtually all of the 12 core controller node's CPUs  
>> rather than just one of them.  
>>  
>> The logs quickly fill with data read timeouts:  
<snip>  
> Hmm. The traceback above really doesn't seem to indict either the token  
> nor the catalog backend -- as the IOError is indicating that there is an  
> issue in *reading the incoming WSGI request body* :(  
>  
> That request body is being constructed clearly by one of the  
> python-xxxclients -- probably python-neutronclient -- so perhaps it is  
> worth doing some python-neutronclient CLI calls using --debug to see if  
> you can get an idea on why on Earth webob.Request is having problems  
> constructing a request body from the client call. Very odd indeed.  

I missed a very relevant bit in my log fragment:  

Script timed out before returning headers: admin  

This is the root cause of the webob trace back, it is the neutron  
client as called form the compute nodes the auth failure is loged in  
nova-compute.log as:  

ERROR nova.network.neutronv2 [-] Neutron client authentication failed:  
Connection to neutron failed: timed out  

I did manage to catch a CLI debug from a similar failure, long hang followed by:  

jon at x1:~$ neutron --debug net-list  

Traceback (most recent call last):  
File "/usr/lib/python2.7/dist-packages/neutronclient/shell.py", line  
517, in run_subcommand  
return run_command(cmd, cmd_parser, sub_argv)  
File "/usr/lib/python2.7/dist-packages/neutronclient/shell.py", line  
78, in run_command  
return cmd.run(known_args)  
File "/usr/lib/python2.7/dist-packages/neutronclient/common/command.py",  
line 35, in run  
return super(OpenStackCommand, self).run(parsed_args)  
File "/usr/lib/python2.7/dist-packages/cliff/display.py", line 84, in run  
column_names, data = self.take_action(parsed_args)  
File "/usr/lib/python2.7/dist-packages/neutronclient/common/command.py",  
line 41, in take_action  
return self.get_data(parsed_args)  
File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",  
line 586, in get_data  
data = self.retrieve_list(parsed_args)  
File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",  
line 532, in retrieve_list  
neutron_client = self.get_client()  
File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",  
line 330, in get_client  
return self.app.client_manager.neutron  
File "/usr/lib/python2.7/dist-packages/neutronclient/common/clientmanager.py",  
line 41, in __get__  
self._handle = self.factory(instance)  
File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/client.py",  
line 36, in make_client  
instance.initialize()  
File "/usr/lib/python2.7/dist-packages/neutronclient/common/clientmanager.py",  
line 92, in initialize  
httpclient.authenticate()  
File "/usr/lib/python2.7/dist-packages/neutronclient/client.py",  
line 237, in authenticate  
raise exceptions.Unauthorized(message=body)  
Unauthorized:  

Thanks,  
-Jon  

_______________________________________________  
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack  
Post to : openstack at lists.openstack.org  
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack/attachments/20140111/9badf951/attachment.html>

Open Stack

[Openstack] [Keystone] performance issues after havana upgrade

OpenStack

Community

Documentation

Branding & Legal