queens heat db deadlock

Zane Bitter zbitter at redhat.com
Wed Jan 2 06:25:37 UTC 2019

On 21/12/18 2:07 AM, Jay Pipes wrote:
> On 12/20/2018 02:01 AM, Zane Bitter wrote:
>> On 19/12/18 6:49 AM, Jay Pipes wrote:
>>> On 12/18/2018 11:06 AM, Mike Bayer wrote:
>>>> On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano
>>>> <ignaziocassano at gmail.com> wrote:
>>>>> Yes, I  tried on yesterday and this workaround solved.
>>>>> Thanks
>>>>> Ignazio
>>>> OK, so that means this "deadlock" is not really a deadlock but it is a
>>>> write-conflict between two Galera masters.      I have a long term
>>>> goal to being relaxing this common requirement that Openstack apps
>>>> only refer to one Galera master at a time.    If this is a particular
>>>> hotspot for Heat (no pun intended) can we pursue adding a transaction
>>>> retry decorator for this operation?  This is the standard approach for
>>>> other applications that are subject to galera multi-master writeset
>>>> conflicts such as Neutron.
>> The weird thing about this issue is that we actually have a retry 
>> decorator on the operation that I assume is the problem. It was added 
>> in Queens and largely fixed this issue in the gate:
>> https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py
>>> Correct.
>>> Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big 
>>> cause of the aforementioned "deadlocks".
>> AFAIK, no. In fact we were quite careful to design stuff that is 
>> expected to be subject to write contention to use UPDATE ... WHERE (by 
>> doing query().filter_by().update() in sqlalchemy), but it turned out 
>> to be those very statements that were most prone to causing deadlocks 
>> in the gate (i.e. we added retry decorators in those two places and 
>> the failures went away), according to me in the commit message for 
>> that patch: https://review.openstack.org/521170
>> Are we Doing It Wrong(TM)?
> No, it looks to me like you're doing things correctly. The OP mentioned 
> that this only happens when deleting a Magnum cluster -- and that it 
> doesn't occur in normal Heat template usage.
> I wonder (as I really don't know anything about Magnum, unfortunately), 
> is there something different about the Magnum cluster resource handling 
> in Heat that might be causing the wonkiness?

There's no special-casing for Magnum within Heat. It's likely to be just 
that there's a lot of resources in a Magnum cluster - or more 
specifically, a lot of edges in the resource graph, which leads to more 
write contention (and, in a multi-master setup, more write conflicts). 
I'd assume that any similarly-complex template would have the same 
issues, and that Ignazio just didn't have anything else that complex to 

That gives me an idea, though. I wonder if this would help:


Ignazio, could you possibly test with that ^ patch in multi-master mode 
to see if it resolves the issue?


