[openstack-dev] [all] [clients] [keystone] lack of retrying tokens leads to overall OpenStack fragility

Flavio Percoco flavio at redhat.com
Fri Sep 12 09:10:24 UTC 2014


On 09/11/2014 01:44 PM, Sean Dague wrote:
> On 09/10/2014 08:46 PM, Jamie Lennox wrote:
>>
>> ----- Original Message -----
>>> From: "Steven Hardy" <shardy at redhat.com>
>>> To: "OpenStack Development Mailing List (not for usage questions)" <openstack-dev at lists.openstack.org>
>>> Sent: Thursday, September 11, 2014 1:55:49 AM
>>> Subject: Re: [openstack-dev] [all] [clients] [keystone] lack of retrying tokens leads to overall OpenStack fragility
>>>
>>> On Wed, Sep 10, 2014 at 10:14:32AM -0400, Sean Dague wrote:
>>>> Going through the untriaged Nova bugs, and there are a few on a similar
>>>> pattern:
>>>>
>>>> Nova operation in progress.... takes a while
>>>> Crosses keystone token expiration time
>>>> Timeout thrown
>>>> Operation fails
>>>> Terrible 500 error sent back to user
>>>
>>> We actually have this exact problem in Heat, which I'm currently trying to
>>> solve:
>>>
>>> https://bugs.launchpad.net/heat/+bug/1306294
>>>
>>> Can you clarify, is the issue either:
>>>
>>> 1. Create novaclient object with username/password
>>> 2. Do series of operations via the client object which eventually fail
>>> after $n operations due to token expiry
>>>
>>> or:
>>>
>>> 1. Create novaclient object with username/password
>>> 2. Some really long operation which means token expires in the course of
>>> the service handling the request, blowing up and 500-ing
>>>
>>> If the former, then it does sound like a client, or usage-of-client bug,
>>> although note if you pass a *token* vs username/password (as is currently
>>> done for glance and heat in tempest, because we lack the code to get the
>>> token outside of the shell.py code..), there's nothing the client can do,
>>> because you can't request a new token with longer expiry with a token...
>>>
>>> However if the latter, then it seems like not really a client problem to
>>> solve, as it's hard to know what action to take if a request failed
>>> part-way through and thus things are in an unknown state.
>>>
>>> This issue is a hard problem, which can possibly be solved by
>>> switching to a trust scoped token (service impersonates the user), but then
>>> you're effectively bypassing token expiry via delegation which sits
>>> uncomfortably with me (despite the fact that we may have to do this in heat
>>> to solve the afforementioned bug)
>>>
>>>> It seems like we should have a standard pattern that on token expiration
>>>> the underlying code at least gives one retry to try to establish a new
>>>> token to complete the flow, however as far as I can tell *no* clients do
>>>> this.
>>>
>>> As has been mentioned, using sessions may be one solution to this, and
>>> AFAIK session support (where it doesn't already exist) is getting into
>>> various clients via the work being carried out to add support for v3
>>> keystone by David Hu:
>>>
>>> https://review.openstack.org/#/q/owner:david.hu%2540hp.com,n,z
>>>
>>> I see patches for Heat (currently gating), Nova and Ironic.
>>>
>>>> I know we had to add that into Tempest because tempest runs can exceed 1
>>>> hr, and we want to avoid random fails just because we cross a token
>>>> expiration boundary.
>>>
>>> I can't claim great experience with sessions yet, but AIUI you could do
>>> something like:
>>>
>>> from keystoneclient.auth.identity import v3
>>> from keystoneclient import session
>>> from keystoneclient.v3 import client
>>>
>>> auth = v3.Password(auth_url=OS_AUTH_URL,
>>>                    username=USERNAME,
>>>                    password=PASSWORD,
>>>                    project_id=PROJECT,
>>>                    user_domain_name='default')
>>> sess = session.Session(auth=auth)
>>> ks = client.Client(session=sess)
>>>
>>> And if you can pass the same session into the various clients tempest
>>> creates then the Password auth-plugin code takes care of reauthenticating
>>> if the token cached in the auth plugin object is expired, or nearly
>>> expired:
>>>
>>> https://github.com/openstack/python-keystoneclient/blob/master/keystoneclient/auth/identity/base.py#L120
>>>
>>> So in the tempest case, it seems like it may be a case of migrating the
>>> code creating the clients to use sessions instead of passing a token or
>>> username/password into the client object?
>>>
>>> That's my understanding of it atm anyway, hopefully jamielennox will be along
>>> soon with more details :)
>>>
>>> Steve
>>
>>
>> By clients here are you referring to the CLIs or the python libraries? Implementation is at different points with each. 
>>
>> Sessions will handle automatically reauthenticating and retrying a request, however it relies on the service throwing a 401 Unauthenticated error. If a service is returning a 500 (or a timeout?) then there isn't much that a client can/should do for that because we can't assume that trying again with a new token will solve anything. 
>>
>> At the moment we have keystoneclient, novaclient, cinderclient neutronclient and then a number of the smaller projects with support for sessions. That obviously doesn't mean that existing users of that code have transitioned to the newer way though. David Hu has been working on using this code within the existing CLIs. I have prototypes for at least nova to talk to neutron and cinder which i'm waiting for Kilo to push. From there it should be easier to do this for other services. 
>>
>> For service to service communication there are two types.
>> 1) using the user's token like nova->cinder. If this token expires there is really nothing that nova can do except raise 401 and make the client do it again. 
> 
> In this case it would be really good to do at least 1 retry, because
> it's completely silly for us to fail an action based on a token timeout.
> The solution ops are doing is changing their token expiration back to
> some really large number.
> 
>> 2) using a service user like nova->neutron. This should allow automatic reauthentication and will be fixed/standardied by sessions. 
> 
> Ok, glanceclient should be a high target here, because that's often
> involved in long running things (snapshot manip is slow).

Agreed. I started looking at this a couple of weeks ago but I'm still
not sure what the best way to do this is. The failure is common when
uploading huge images and I also agree that at least 1 retry should be
attempted.


Flavio


-- 
@flaper87
Flavio Percoco



More information about the OpenStack-dev mailing list