On 21/12/18 2:07 AM, Jay Pipes wrote:
> On 12/20/2018 02:01 AM, Zane Bitter wrote:
>> On 19/12/18 6:49 AM, Jay Pipes wrote:
>>> On 12/18/2018 11:06 AM, Mike Bayer wrote:
>>>> On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano
>>>> <ignaziocassano@gmail.com> wrote:
>>>>>
>>>>> Yes, I tried on yesterday and this workaround solved.
>>>>> Thanks
>>>>> Ignazio
>>>>
>>>> OK, so that means this "deadlock" is not really a deadlock but it is a
>>>> write-conflict between two Galera masters. I have a long term
>>>> goal to being relaxing this common requirement that Openstack apps
>>>> only refer to one Galera master at a time. If this is a particular
>>>> hotspot for Heat (no pun intended) can we pursue adding a transaction
>>>> retry decorator for this operation? This is the standard approach for
>>>> other applications that are subject to galera multi-master writeset
>>>> conflicts such as Neutron.
>>
>> The weird thing about this issue is that we actually have a retry
>> decorator on the operation that I assume is the problem. It was added
>> in Queens and largely fixed this issue in the gate:
>>
>> https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py
>>
>>> Correct.
>>>
>>> Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big
>>> cause of the aforementioned "deadlocks".
>>
>> AFAIK, no. In fact we were quite careful to design stuff that is
>> expected to be subject to write contention to use UPDATE ... WHERE (by
>> doing query().filter_by().update() in sqlalchemy), but it turned out
>> to be those very statements that were most prone to causing deadlocks
>> in the gate (i.e. we added retry decorators in those two places and
>> the failures went away), according to me in the commit message for
>> that patch: https://review.openstack.org/521170
>>
>> Are we Doing It Wrong(TM)?
>
> No, it looks to me like you're doing things correctly. The OP mentioned
> that this only happens when deleting a Magnum cluster -- and that it
> doesn't occur in normal Heat template usage.
>
> I wonder (as I really don't know anything about Magnum, unfortunately),
> is there something different about the Magnum cluster resource handling
> in Heat that might be causing the wonkiness?
There's no special-casing for Magnum within Heat. It's likely to be just
that there's a lot of resources in a Magnum cluster - or more
specifically, a lot of edges in the resource graph, which leads to more
write contention (and, in a multi-master setup, more write conflicts).
I'd assume that any similarly-complex template would have the same
issues, and that Ignazio just didn't have anything else that complex to
hand.
That gives me an idea, though. I wonder if this would help:
https://review.openstack.org/627914
Ignazio, could you possibly test with that ^ patch in multi-master mode
to see if it resolves the issue?
cheers,
Zane.