[Openstack] [Keystone] performance issues after havana upgrade

Jonathan Proulx jon at jonproulx.com
Sat Jan 11 20:39:46 UTC 2014


Note about 30min after flushing memcache things started gumming up
again (maybe 15min long tokens instead of 60min?).  Details in this
response reflect the state of my world at about that point.  Recent
flush with things just starting to grind to a halt.

On Sat, Jan 11, 2014 at 2:46 PM, Jay Pipes <jaypipes at gmail.com> wrote:
> Hi Jonathan! I have not yet deployed Havana to production, however I'll
> add some comments and suggestions below that you may want to try in
> order to isolate root cause...
>
> On Sat, 2014-01-11 at 13:34 -0500, Jonathan Proulx wrote:
>> Hi All,
>>
>> recently upgraded my 1 controller 60 compute node Ubuntu 12.04(+cloud
>> archive) system from Grizlzy to Havana.  Now even before I let my
>> users back to the API I'm barely able to do anything due to
>> authentication time outs.  I am using neutron which like to
>> authenticate *a lot*, I'm not entierly convinced the real problem
>> isn't neutron reauthenticating a bajillion services a second, but
>> right now it looks like Keystone...
>>
>> I'm using UUID Tokens with memcached backend.
>
> Two requests:
>
> 1) Could you paste your keystone.conf pls? I'm curious, particularly,
> what backend you are using for the catalog driver and the identity
> driver.

Incomplete but I think all the relevant bits,  I can sanitize and past
up the whole thing if you really think that would be helpful.

[sql]
connection    =
mysql://<keystone_db_user>:<keystone_db_password>@192.168.128.15/keystone
idle_timeout = 200
[memcache]
servers = localhost:11211
[identity]
driver = keystone.identity.backends.sql.Identity
[trust]
driver = keystone.trust.backends.sql.Trust
[catalog]
driver = keystone.catalog.backends.sql.Catalog
[token]
driver = keystone.token.backends.memcache.Token
expiration = 3600
provider=keystone.token.providers.uuid.Provider
caching = False
[cache]
enabled = True
backend = dogpile.cache.pylibmc
backend_argument = url:127.0.0.1
[policy]
driver = keystone.policy.backends.sql.Policy
[ec2]
driver = keystone.contrib.ec2.backends.sql.Ec2


> 2) How are you configuring memcache? Please let us know if you've given
> memcache a decent amount of memory to support the token cache. I
> believe, IIRC, our memcache-per-controller was 2GB, but we may have
> increased that to 4GB... can't remember. In any case, it should be a
> decent amount of memory if you have lots of token validation and/or
> revocation calls happening.

# Use <num> MB memory max to use for object storage.
-m 45876
# Limit the number of simultaneous incoming connections.
-c 8192
# Number of threads to use to process incoming requests.
-t 24

> Do a little bit of memcache performance investigation for us, as well.
> It will tell us if memcache is having trouble keeping up with constant
> cache flushing...
>
> If you telnet to the memcached instance:
>
> telnet <HOST> 11211
>
> and then type "stats" a few times in a minute. Keep an eye on the number
> returned for "evictions". If that number increases substantially over
> the time you do this mini-profile, it's an indication that memcached is
> having a tough time keeping up with the rate of cache insertions, and is
> looking for least-used blocks to evict in order to find room to put new
> blocks in. This can sometimes be an indication you need to increase the
> amount of memory for memcached.

Note from previous email I recently flushed the cache, but for now
evictions are holding steady at 0. For other maybe relevant memcache
stats:

STAT uptime 2173
STAT time 1389471465
STAT version 1.4.13
STAT libevent 2.0.16-stable
STAT pointer_size 64
STAT rusage_user 48.771048
STAT rusage_system 278.953433
STAT curr_connections 51
STAT total_connections 2274
STAT connection_structures 769
STAT reserved_fds 120
STAT cmd_get 4719079
STAT cmd_set 15122
STAT cmd_flush 1
STAT cmd_touch 0
STAT get_hits 4716287
STAT get_misses 2792
...
STAT bytes_read 376435142
STAT bytes_written 14583390698
STAT limit_maxbytes 48104472576
...
STAT bytes 4886925
STAT curr_items 1572
STAT total_items 2394
STAT evictions 0
STAT reclaimed 0

> OK, so clearly Keystone is configured to use the sql driver for at least
> *something*... likely the identity and/or catalog driver.
>
> If you are using the sql driver for the catalog, I would highly advise
> switching to the templated catalog driver. It's essentially a static
> file that is cached by the driver and used as a string interpolation
> template for the service catalog (used on EVERY single request for a new
> token, since the token includes the service catalog attached to the
> domain/project/user).

Yes, as you can see both identity and catalog are using sql back end.
I'll take a look at the .templated.TemplatedCatalog, which may help
some, but I don't think it's the whole story (though every bit helps).

>> I took the next obvious step and put keystone behind apache, while
>> that does get more servers running, performance if anything is even
>> worse while using virtually all of the 12 core controller node's CPUs
>> rather than just one of them.
>>
>> The logs quickly fill with data read timeouts:
<snip>
> Hmm. The traceback above really doesn't seem to indict either the token
> nor the catalog backend -- as the IOError is indicating that there is an
> issue in *reading the incoming WSGI request body* :(
>
> That request body is being constructed clearly by one of the
> python-xxxclients -- probably python-neutronclient -- so perhaps it is
> worth doing some python-neutronclient CLI calls using --debug to see if
> you can get an idea on why on Earth webob.Request is having problems
> constructing a request body from the client call. Very odd indeed.

I missed a very relevant bit in my log fragment:

Script timed out before returning headers: admin

This is the root cause of the webob trace back, it is the neutron
client as called form the compute nodes the auth failure is loged in
nova-compute.log as:

ERROR nova.network.neutronv2 [-] Neutron client authentication failed:
Connection to neutron failed: timed out

I did manage to catch a CLI debug from a similar failure, long hang followed by:

jon at x1:~$ neutron --debug net-list

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/neutronclient/shell.py", line
517, in run_subcommand
    return run_command(cmd, cmd_parser, sub_argv)
  File "/usr/lib/python2.7/dist-packages/neutronclient/shell.py", line
78, in run_command
    return cmd.run(known_args)
  File "/usr/lib/python2.7/dist-packages/neutronclient/common/command.py",
line 35, in run
    return super(OpenStackCommand, self).run(parsed_args)
  File "/usr/lib/python2.7/dist-packages/cliff/display.py", line 84, in run
    column_names, data = self.take_action(parsed_args)
  File "/usr/lib/python2.7/dist-packages/neutronclient/common/command.py",
line 41, in take_action
    return self.get_data(parsed_args)
  File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",
line 586, in get_data
    data = self.retrieve_list(parsed_args)
  File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",
line 532, in retrieve_list
    neutron_client = self.get_client()
  File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/v2_0/__init__.py",
line 330, in get_client
    return self.app.client_manager.neutron
  File "/usr/lib/python2.7/dist-packages/neutronclient/common/clientmanager.py",
line 41, in __get__
    self._handle = self.factory(instance)
  File "/usr/lib/python2.7/dist-packages/neutronclient/neutron/client.py",
line 36, in make_client
    instance.initialize()
  File "/usr/lib/python2.7/dist-packages/neutronclient/common/clientmanager.py",
line 92, in initialize
    httpclient.authenticate()
  File "/usr/lib/python2.7/dist-packages/neutronclient/client.py",
line 237, in authenticate
    raise exceptions.Unauthorized(message=body)
Unauthorized:

Thanks,
-Jon




More information about the Openstack mailing list