<div dir="ltr"><div>Hello Zane, we applyed the patch and modified our haproxy : unfortunately it does not solve db deadlock issue.</div><div>Ignazio & Gianpiero<br></div></div><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr">Il giorno mer 2 gen 2019 alle ore 07:28 Zane Bitter <<a href="mailto:zbitter@redhat.com" target="_blank">zbitter@redhat.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 21/12/18 2:07 AM, Jay Pipes wrote:<br>
> On 12/20/2018 02:01 AM, Zane Bitter wrote:<br>
>> On 19/12/18 6:49 AM, Jay Pipes wrote:<br>
>>> On 12/18/2018 11:06 AM, Mike Bayer wrote:<br>
>>>> On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano<br>
>>>> <<a href="mailto:ignaziocassano@gmail.com" target="_blank">ignaziocassano@gmail.com</a>> wrote:<br>
>>>>><br>
>>>>> Yes, I tried on yesterday and this workaround solved.<br>
>>>>> Thanks<br>
>>>>> Ignazio<br>
>>>><br>
>>>> OK, so that means this "deadlock" is not really a deadlock but it is a<br>
>>>> write-conflict between two Galera masters. I have a long term<br>
>>>> goal to being relaxing this common requirement that Openstack apps<br>
>>>> only refer to one Galera master at a time. If this is a particular<br>
>>>> hotspot for Heat (no pun intended) can we pursue adding a transaction<br>
>>>> retry decorator for this operation? This is the standard approach for<br>
>>>> other applications that are subject to galera multi-master writeset<br>
>>>> conflicts such as Neutron.<br>
>><br>
>> The weird thing about this issue is that we actually have a retry <br>
>> decorator on the operation that I assume is the problem. It was added <br>
>> in Queens and largely fixed this issue in the gate:<br>
>><br>
>> <a href="https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py" rel="noreferrer" target="_blank">https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py</a><br>
>><br>
>>> Correct.<br>
>>><br>
>>> Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big <br>
>>> cause of the aforementioned "deadlocks".<br>
>><br>
>> AFAIK, no. In fact we were quite careful to design stuff that is <br>
>> expected to be subject to write contention to use UPDATE ... WHERE (by <br>
>> doing query().filter_by().update() in sqlalchemy), but it turned out <br>
>> to be those very statements that were most prone to causing deadlocks <br>
>> in the gate (i.e. we added retry decorators in those two places and <br>
>> the failures went away), according to me in the commit message for <br>
>> that patch: <a href="https://review.openstack.org/521170" rel="noreferrer" target="_blank">https://review.openstack.org/521170</a><br>
>><br>
>> Are we Doing It Wrong(TM)?<br>
> <br>
> No, it looks to me like you're doing things correctly. The OP mentioned <br>
> that this only happens when deleting a Magnum cluster -- and that it <br>
> doesn't occur in normal Heat template usage.<br>
> <br>
> I wonder (as I really don't know anything about Magnum, unfortunately), <br>
> is there something different about the Magnum cluster resource handling <br>
> in Heat that might be causing the wonkiness?<br>
<br>
There's no special-casing for Magnum within Heat. It's likely to be just <br>
that there's a lot of resources in a Magnum cluster - or more <br>
specifically, a lot of edges in the resource graph, which leads to more <br>
write contention (and, in a multi-master setup, more write conflicts). <br>
I'd assume that any similarly-complex template would have the same <br>
issues, and that Ignazio just didn't have anything else that complex to <br>
hand.<br>
<br>
That gives me an idea, though. I wonder if this would help:<br>
<br>
<a href="https://review.openstack.org/627914" rel="noreferrer" target="_blank">https://review.openstack.org/627914</a><br>
<br>
Ignazio, could you possibly test with that ^ patch in multi-master mode <br>
to see if it resolves the issue?<br>
<br>
cheers,<br>
Zane.<br>
<br>
</blockquote></div></div>