[openstack-dev] [neutron][oslo.db] db retry madness

Michael Bayer mbayer at redhat.com
Fri Sep 8 14:20:54 UTC 2017


On Fri, Sep 8, 2017 at 8:53 AM, Kevin Benton <kevin at benton.pub> wrote:
> Thanks for asking this. It's good to get this documented on the mailing list
> since most of it is spread across bug reports and commits over the last
> couple of years.
>
>>1. since the retry_if_session_inactive decorator assumes functions that
>> receive a "context", what does it mean that networking_odl uses functions
>> that receive a "session" and not a "context" ?   Should networking_odl be
>> modified such that it accepts the same "context" that Neutron passes around?
>>
>>2. Alternatively, should the retry_if_session_inactive decorator be
>> modified (or a new decorator added) that provides the identical behavior for
>> functions that receive a "session" argument rather than a "context" argument
>> ?  (or should this be done anyway even if networking_odl's pattern is
>> legacy?)
>
> So it really depends on how those particular functions are being called. For
> example, it seems like this function in the review is always going to be
> called within an active session:
> https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@113  If
> that's the case then the decorator is never going to trigger a retry since
> is_active will be True.

>
> Since the goal of that patch is to deal with deadlocks, the retry can't
> happen down at that level, it will need to be somewhere further up the call
> stack since the deadlock will put the session in a rollback state.


OK when I was talking to Michel about this, I first mentioned that to
get retry within this in-a-transaction block would require adding a
savepoint to the mix, but he seemed to indicate that the method can
also run by itself, but since there's no session.begin() there then
that matches with my initial impression that the retry has to be at
the level which the transaction starts, and that's not here.



>
> For the functions being called outside of an active session, they should
> switch to accepting a context to handle the enginefacade switch.
>
>>  3a.  Is this a temporary measure to work around legacy patterns?
>
> Yes. When that went in we had very little adoption of the enginefacade. Even
> now we haven't fully completed the transition so checking for is_active
> covers the old code paths.


OK, so this is where they are still calling session.begin()
themselves.    Are all these session.begin() calls in projects under
the "openstack" umbrella so I can use codesearch ?



>
> As we continue migrating to the enginefacade, the cases where you can get a
> reference to a session that actually has 'is_active==False' are going to
> keep dwindling. Once we complete the switch and remove the legacy session
> constructor, the decorator will just be adjusted to check if there is a
> session attached to the context to determine if retries are safe since
> is_active will always be True.

neat

>
>> 3b.  If 3a., is the pattern in use by networking_odl, receiving a
>> "session" and not "context", the specific legacy pattern in question?
>
> If those functions are called outside of an active session, then yes. Based
> on the session.begin() call without substransactions=True, I'm guessing
> that's the case for at least some of them:
> https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@85

Is there guidance written down somewhere I can point plugin teams towards ?


>
>>3c.  if 3a and 3b, are we expecting all the various plugins to start using
>> "context" at some point ?  (and when they come to me with breakage, I can
>> push them in that direction?)
>
> Yeah, for any functions that expect to start their own transaction they
> should altered to accept contexts and use the reader/writer context
> managers. If it's a function that is called from within an ongoing
> transaction, it can obviously still accept a session argument but the
> 'retry_if_session_inactive' decorator would obviously be irrelevant in that
> scenario since the session would always be active.
>
>> 4a.  Is it strictly that simple mutable lists and dicts of simple types
>> (ints, strings, etc.) are preserved across retries, assuming the receiving
>> functions might be modifying them?
>
> Yes, this was to deal with functions that would mutate collections passed
> in. In particular, we have functions that alter the parsed API request as
> they process them. If they hit a deadlock on commit and we retry at that
> point the API request would be mangled without the copy. (e.g.
> https://bugs.launchpad.net/neutron/+bug/1584920)
>
>>4b. Does this deepcopy() approach have any support for functions that are
>> being passed not just simple structures but ORM-mapped objects?   Was this
>> use case considered?
>
> No, it doesn't support ORM objects. Given that the target use case of the
> retry decorator was for functions not in an ongoing session, we didn't have
> a pattern to support in the main tree where an ORM object would be passed as
> an argument to the function that needed to be protected.

right, makes sense, and also that is still how it's going to be here
because you have to be at the top of the transaction (unless using
SAVEPOINT).


>
>>4c. Does Neutron's model base have any support for deepcopy()?  I notice
>> that the base classes in model_base do **not** seem to have a __deepcopy__
>> present.  This would mean that this deepcopy() will definitely break any ORM
>> mapped objects because you need to use an approach like what Nova uses for
>> copy():
>> https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/models.py#L46
>> . This would be why the developer's objects are all getting kicked
> out of the session.   (this is the "is this a bug?" part)
>
> That's definitely why the objects are being kicked out of the session. I'm
> not sure if it makes sense to attempt to support copying an ORM object when
> the decorator is meant for functions starting a completely new session. If
> there is a use case I'm missing that we do want to support, it probably
> makes more sense to use a different decorator than to adjust the existing
> one.

you can put a __copy__() and __deepcopy__() on the base that raises an
exception with an informative message, perhaps.    (maybe this should
be an oslo.db thing).


>
>>4d. How frequent is the problem of mutable lists and dicts, and should this
>> complex and actually pretty expensive behavior be optional, only for those
>> functions that are known to receieve mutable structures and change them in
>> place?
>
> It's difficult to tell. The issue was that the retry logic was added much
> later to the API level so we needed a generically safe solution to protect
> every possible API operation. It's copying things passed into the API
> request so they are usually dictionaries with fewer than 10 keys and values
> with small strings or lists of only a few entries, making the copy
> operations cheap.
>
> If we did want to get rid of the copy operation, it would probably be
> quicker to add a new decorator that doesn't copy and then slowly opt into
> for each function that we validate as 'side-effect free' on the inputs.

+1


>
>
> I hope that makes sense. Let me know if you have any other questions.

That covers it and also I should have put together the "retry is
outside the transaction == there are no ORM objects at that level"
thing.

>
> Cheers,
> Kevin Benton
>
> On Thu, Sep 7, 2017 at 8:11 AM, Michael Bayer <mbayer at redhat.com> wrote:
>>
>> I'm trying to get a handle on neutron's retry decorators.    I'm
>> assisting a developer in the networking_odl project who wishes to make
>> use of neutron's retry_if_session_inactive decorator.
>>
>> The developer has run into several problems, including that
>> networking_odl methods here seem to accept a "session" directly,
>> rather than the "context" as is standard throughout neutron, and
>> retry_if_session_inactive assumes the context calling pattern.
>> Additionally, the retry_db_errors() decorator which is also used by
>> retry_if_session_inactive is doing a deepcopy operation on arguments,
>> which when applied to ORM-mapped entities, makes a copy of them which
>> are then detached from the Session and they fail to function beyond
>> that.
>>
>> For context, the developer's patch is at:
>> https://review.openstack.org/#/c/500584/3
>>
>> and the neutron features I'm looking at can be seen when they were
>> reviewed at: https://review.openstack.org/#/c/356530/22
>>
>>
>> So I have a bunch of questions, many are asking the same things: "is
>> networking_odl's pattern legacy or not?" and "is this retry decorator
>> buggy?"   There's overlap in these questions.  But "yes" / "no" to
>> each would still help me figure out that this is a giraffe and not a
>> duck :) ("does it have a long neck?" "does it quack?" etc)
>>
>> 1. since the retry_if_session_inactive decorator assumes functions
>> that receive a "context", what does it mean that networking_odl uses
>> functions that receive a "session" and not a "context" ?   Should
>> networking_odl be modified such that it accepts the same "context"
>> that Neutron passes around?
>>
>> 2. Alternatively, should the retry_if_session_inactive decorator be
>> modified (or a new decorator added) that provides the identical
>> behavior for functions that receive a "session" argument rather than a
>> "context" argument ?  (or should this be done anyway even if
>> networking_odl's pattern is legacy?)
>>
>> 3. I notice that the session.is_active flag is considered to be
>> meaningful.   Normally, this flag is of no use, as the enginefacade
>> pattern only provides for a Session that has had the begin() method
>> called.  However, it looks like in neutron_lib/context.py, the Context
>> class is providing for a default session
>> "db_api.get_writer_session()", which I would assume is how we get a
>> Session with is_active is false:
>>
>>     3a.  Is this a temporary measure to work around legacy patterns?
>>     3b.  If 3a., is the pattern in use by networking_odl, receiving a
>> "session" and not "context", the specific legacy pattern in question?
>>     3c.  if 3a and 3b, are we expecting all the various plugins to
>> start using "context" at some point ?  (and when they come to me with
>> breakage, I can push them in that direction?)
>>
>> 4. What is the bug we are fixing by using deepcopy() with the
>> arguments passed to the decorated function:
>>
>>     4a.  Is it strictly that simple mutable lists and dicts of simple
>> types (ints, strings, etc.) are preserved across retries, assuming the
>> receiving functions might be modifying them?
>>
>>     4b. Does this deepcopy() approach have any support for functions
>> that are being passed not just simple structures but ORM-mapped
>> objects?   Was this use case considered?
>>
>>     4c. Does Neutron's model base have any support for deepcopy()?  I
>> notice that the base classes in model_base do **not** seem to have a
>> __deepcopy__ present.  This would mean that this deepcopy() will
>> definitely break any ORM mapped objects because you need to use an
>> approach like what Nova uses for copy():
>>
>> https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/models.py#L46
>> .  This would be why the developer's objects are all getting kicked
>> out of the session.   (this is the "is this a bug?" part)
>>
>>     4d. How frequent is the problem of mutable lists and dicts, and
>> should this complex and actually pretty expensive behavior be
>> optional, only for those functions that are known to receieve mutable
>> structures and change them in place?
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>



More information about the OpenStack-dev mailing list