Re: queens heat db deadlock

2 Jan 2019

      On 21/12/18 2:07 AM, Jay Pipes wrote:
...
On 12/20/2018 02:01 AM, Zane Bitter wrote:
...
On 19/12/18 6:49 AM, Jay Pipes wrote:
...
On 12/18/2018 11:06 AM, Mike Bayer wrote:
...
On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano
<ignaziocassano@gmail.com> wrote:
...
Yes, I  tried on yesterday and this workaround solved.
Thanks
Ignazio
OK, so that means this "deadlock" is not really a deadlock but it is a
write-conflict between two Galera masters.      I have a long term
goal to being relaxing this common requirement that Openstack apps
only refer to one Galera master at a time.    If this is a particular
hotspot for Heat (no pun intended) can we pursue adding a transaction
retry decorator for this operation?  This is the standard approach for
other applications that are subject to galera multi-master writeset
conflicts such as Neutron.
The weird thing about this issue is that we actually have a retry 
decorator on the operation that I assume is the problem. It was added 
in Queens and largely fixed this issue in the gate:
https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py
...
Correct.
Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big 
cause of the aforementioned "deadlocks".
AFAIK, no. In fact we were quite careful to design stuff that is 
expected to be subject to write contention to use UPDATE ... WHERE (by 
doing query().filter_by().update() in sqlalchemy), but it turned out 
to be those very statements that were most prone to causing deadlocks 
in the gate (i.e. we added retry decorators in those two places and 
the failures went away), according to me in the commit message for 
that patch: https://review.openstack.org/521170
Are we Doing It Wrong(TM)?
No, it looks to me like you're doing things correctly. The OP mentioned 
that this only happens when deleting a Magnum cluster -- and that it 
doesn't occur in normal Heat template usage.
I wonder (as I really don't know anything about Magnum, unfortunately), 
is there something different about the Magnum cluster resource handling 
in Heat that might be causing the wonkiness?
There's no special-casing for Magnum within Heat. It's likely to be just 
that there's a lot of resources in a Magnum cluster - or more 
specifically, a lot of edges in the resource graph, which leads to more 
write contention (and, in a multi-master setup, more write conflicts). 
I'd assume that any similarly-complex template would have the same 
issues, and that Ignazio just didn't have anything else that complex to 
hand.

That gives me an idea, though. I wonder if this would help:

https://review.openstack.org/627914

Ignazio, could you possibly test with that ^ patch in multi-master mode 
to see if it resolves the issue?

cheers,
Zane.