[openstack-dev] [neutron][oslo.db] db retry madness

Kevin Benton kevin at benton.pub
Fri Sep 8 12:53:12 UTC 2017


Thanks for asking this. It's good to get this documented on the mailing
list since most of it is spread across bug reports and commits over the
last couple of years.

>1. since the retry_if_session_inactive decorator assumes functions that
receive a "context", what does it mean that networking_odl uses functions
that receive a "session" and not a "context" ?   Should networking_odl be
modified such that it accepts the same "context" that Neutron passes around?
>
>2. Alternatively, should the retry_if_session_inactive decorator be modified
(or a new decorator added) that provides the identical behavior for
functions that receive a "session" argument rather than a "context"
argument ?  (or should this be done anyway even if networking_odl's pattern
is legacy?)

So it really depends on how those particular functions are being called. For
example, it seems like this function in the review is always going to be
called within an active session:
https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@113  If
that's the case then the decorator is never going to trigger a retry since
is_active will be True.

Since the goal of that patch is to deal with deadlocks, the retry can't
happen down at that level, it will need to be somewhere further up the call
stack since the deadlock will put the session in a rollback state.

For the functions being called outside of an active session, they should
switch to accepting a context to handle the enginefacade switch.

>  3a.  Is this a temporary measure to work around legacy patterns?

Yes. When that went in we had very little adoption of the enginefacade.
Even now we haven't fully completed the transition so checking for
is_active covers the old code paths.

As we continue migrating to the enginefacade, the cases where you can get a
reference to a session that actually has 'is_active==False' are going to
keep dwindling. Once we complete the switch and remove the legacy session
constructor, the decorator will just be adjusted to check if there is a
session attached to the context to determine if retries are safe since
is_active will always be True.

> 3b.  If 3a., is the pattern in use by networking_odl, receiving a "session"
and not "context", the specific legacy pattern in question?

If those functions are called outside of an active session, then yes. Based
on the session.begin() call without substransactions=True, I'm guessing
that's the case for at least some of them:
https://review.openstack.org/#/c/500584/3/networking_odl/db/db.py@85

>3c.  if 3a and 3b, are we expecting all the various plugins to start using
"context" at some point ?  (and when they come to me with breakage, I can
push them in that direction?)

Yeah, for any functions that expect to start their own transaction they
should altered to accept contexts and use the reader/writer context
managers. If it's a function that is called from within an ongoing
transaction, it can obviously still accept a session argument but the
'retry_if_session_inactive' decorator would obviously be irrelevant in that
scenario since the session would always be active.

> 4a.  Is it strictly that simple mutable lists and dicts of simple types
(ints, strings, etc.) are preserved across retries, assuming the receiving
functions might be modifying them?

Yes, this was to deal with functions that would mutate collections passed
in. In particular, we have functions that alter the parsed API request as
they process them. If they hit a deadlock on commit and we retry at that
point the API request would be mangled without the copy. (e.g.
https://bugs.launchpad.net/neutron/+bug/1584920)

>4b. Does this deepcopy() approach have any support for functions that are
being passed not just simple structures but ORM-mapped objects?   Was this
use case considered?

No, it doesn't support ORM objects. Given that the target use case of the
retry decorator was for functions not in an ongoing session, we didn't have
a pattern to support in the main tree where an ORM object would be passed
as an argument to the function that needed to be protected.

>4c. Does Neutron's model base have any support for deepcopy()?  I notice
that the base classes in model_base do **not** seem to have a __deepcopy__
present.  This would mean that this deepcopy() will definitely break any
ORM mapped objects because you need to use an approach like what Nova uses
for copy(): https://github.com/openstack/nova/blob/master/nova/db/
sqlalchemy/models.py#L46 . This would be why the developer's objects are
all getting kicked
out of the session.   (this is the "is this a bug?" part)

That's definitely why the objects are being kicked out of the session. I'm
not sure if it makes sense to attempt to support copying an ORM object when
the decorator is meant for functions starting a completely new session. If
there is a use case I'm missing that we do want to support, it probably
makes more sense to use a different decorator than to adjust the existing
one.

>4d. How frequent is the problem of mutable lists and dicts, and should
this complex and actually pretty expensive behavior be optional, only for
those functions that are known to receieve mutable structures and change
them in place?

It's difficult to tell. The issue was that the retry logic was added much
later to the API level so we needed a generically safe solution to protect
every possible API operation. It's copying things passed into the API
request so they are usually dictionaries with fewer than 10 keys and values
with small strings or lists of only a few entries, making the copy
operations cheap.

If we did want to get rid of the copy operation, it would probably be
quicker to add a new decorator that doesn't copy and then slowly opt into
for each function that we validate as 'side-effect free' on the inputs.


I hope that makes sense. Let me know if you have any other questions.

Cheers,
Kevin Benton

On Thu, Sep 7, 2017 at 8:11 AM, Michael Bayer <mbayer at redhat.com> wrote:

> I'm trying to get a handle on neutron's retry decorators.    I'm
> assisting a developer in the networking_odl project who wishes to make
> use of neutron's retry_if_session_inactive decorator.
>
> The developer has run into several problems, including that
> networking_odl methods here seem to accept a "session" directly,
> rather than the "context" as is standard throughout neutron, and
> retry_if_session_inactive assumes the context calling pattern.
> Additionally, the retry_db_errors() decorator which is also used by
> retry_if_session_inactive is doing a deepcopy operation on arguments,
> which when applied to ORM-mapped entities, makes a copy of them which
> are then detached from the Session and they fail to function beyond
> that.
>
> For context, the developer's patch is at:
> https://review.openstack.org/#/c/500584/3
>
> and the neutron features I'm looking at can be seen when they were
> reviewed at: https://review.openstack.org/#/c/356530/22
>
>
> So I have a bunch of questions, many are asking the same things: "is
> networking_odl's pattern legacy or not?" and "is this retry decorator
> buggy?"   There's overlap in these questions.  But "yes" / "no" to
> each would still help me figure out that this is a giraffe and not a
> duck :) ("does it have a long neck?" "does it quack?" etc)
>
> 1. since the retry_if_session_inactive decorator assumes functions
> that receive a "context", what does it mean that networking_odl uses
> functions that receive a "session" and not a "context" ?   Should
> networking_odl be modified such that it accepts the same "context"
> that Neutron passes around?
>
> 2. Alternatively, should the retry_if_session_inactive decorator be
> modified (or a new decorator added) that provides the identical
> behavior for functions that receive a "session" argument rather than a
> "context" argument ?  (or should this be done anyway even if
> networking_odl's pattern is legacy?)
>
> 3. I notice that the session.is_active flag is considered to be
> meaningful.   Normally, this flag is of no use, as the enginefacade
> pattern only provides for a Session that has had the begin() method
> called.  However, it looks like in neutron_lib/context.py, the Context
> class is providing for a default session
> "db_api.get_writer_session()", which I would assume is how we get a
> Session with is_active is false:
>
>     3a.  Is this a temporary measure to work around legacy patterns?
>     3b.  If 3a., is the pattern in use by networking_odl, receiving a
> "session" and not "context", the specific legacy pattern in question?
>     3c.  if 3a and 3b, are we expecting all the various plugins to
> start using "context" at some point ?  (and when they come to me with
> breakage, I can push them in that direction?)
>
> 4. What is the bug we are fixing by using deepcopy() with the
> arguments passed to the decorated function:
>
>     4a.  Is it strictly that simple mutable lists and dicts of simple
> types (ints, strings, etc.) are preserved across retries, assuming the
> receiving functions might be modifying them?
>
>     4b. Does this deepcopy() approach have any support for functions
> that are being passed not just simple structures but ORM-mapped
> objects?   Was this use case considered?
>
>     4c. Does Neutron's model base have any support for deepcopy()?  I
> notice that the base classes in model_base do **not** seem to have a
> __deepcopy__ present.  This would mean that this deepcopy() will
> definitely break any ORM mapped objects because you need to use an
> approach like what Nova uses for copy():
> https://github.com/openstack/nova/blob/master/nova/db/
> sqlalchemy/models.py#L46
> .  This would be why the developer's objects are all getting kicked
> out of the session.   (this is the "is this a bug?" part)
>
>     4d. How frequent is the problem of mutable lists and dicts, and
> should this complex and actually pretty expensive behavior be
> optional, only for those functions that are known to receieve mutable
> structures and change them in place?
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-dev/attachments/20170908/a2ec831d/attachment.html>


More information about the OpenStack-dev mailing list