queens heat db deadlock

Jay Pipes jaypipes at gmail.com
Thu Dec 20 13:07:34 UTC 2018


On 12/20/2018 02:01 AM, Zane Bitter wrote:
> On 19/12/18 6:49 AM, Jay Pipes wrote:
>> On 12/18/2018 11:06 AM, Mike Bayer wrote:
>>> On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano
>>> <ignaziocassano at gmail.com> wrote:
>>>>
>>>> Yes, I  tried on yesterday and this workaround solved.
>>>> Thanks
>>>> Ignazio
>>>
>>> OK, so that means this "deadlock" is not really a deadlock but it is a
>>> write-conflict between two Galera masters.      I have a long term
>>> goal to being relaxing this common requirement that Openstack apps
>>> only refer to one Galera master at a time.    If this is a particular
>>> hotspot for Heat (no pun intended) can we pursue adding a transaction
>>> retry decorator for this operation?  This is the standard approach for
>>> other applications that are subject to galera multi-master writeset
>>> conflicts such as Neutron.
> 
> The weird thing about this issue is that we actually have a retry 
> decorator on the operation that I assume is the problem. It was added in 
> Queens and largely fixed this issue in the gate:
> 
> https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py
> 
>> Correct.
>>
>> Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big cause 
>> of the aforementioned "deadlocks".
> 
> AFAIK, no. In fact we were quite careful to design stuff that is 
> expected to be subject to write contention to use UPDATE ... WHERE (by 
> doing query().filter_by().update() in sqlalchemy), but it turned out to 
> be those very statements that were most prone to causing deadlocks in 
> the gate (i.e. we added retry decorators in those two places and the 
> failures went away), according to me in the commit message for that 
> patch: https://review.openstack.org/521170
> 
> Are we Doing It Wrong(TM)?

No, it looks to me like you're doing things correctly. The OP mentioned 
that this only happens when deleting a Magnum cluster -- and that it 
doesn't occur in normal Heat template usage.

I wonder (as I really don't know anything about Magnum, unfortunately), 
is there something different about the Magnum cluster resource handling 
in Heat that might be causing the wonkiness?

Best,
-jay



More information about the openstack-discuss mailing list