<div dir="auto">Hello, seems deadlock happens when heat craates network objects.<div dir="auto">I am not sure, but if I remember some db deadlock also in cinder.</div><div dir="auto">Any case with the workaround heat never fails for db deadlock and Now stack deleting do not stop.</div><div dir="auto">Ignazio</div></div><br><div class="gmail_quote"><div dir="ltr">Il giorno Mar 18 Dic 2018 17:06 Mike Bayer <<a href="mailto:mike_mp@zzzcomputing.com">mike_mp@zzzcomputing.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano<br>
<<a href="mailto:ignaziocassano@gmail.com" target="_blank" rel="noreferrer">ignaziocassano@gmail.com</a>> wrote:<br>
><br>
> Yes, I tried on yesterday and this workaround solved.<br>
> Thanks<br>
> Ignazio<br>
<br>
OK, so that means this "deadlock" is not really a deadlock but it is a<br>
write-conflict between two Galera masters. I have a long term<br>
goal to being relaxing this common requirement that Openstack apps<br>
only refer to one Galera master at a time. If this is a particular<br>
hotspot for Heat (no pun intended) can we pursue adding a transaction<br>
retry decorator for this operation? This is the standard approach for<br>
other applications that are subject to galera multi-master writeset<br>
conflicts such as Neutron.<br>
<br>
<br>
<br>
<br>
><br>
> Il giorno Lun 17 Dic 2018 20:38 Joe Topjian <<a href="mailto:joe@topjian.net" target="_blank" rel="noreferrer">joe@topjian.net</a>> ha scritto:<br>
>><br>
>> Hi Ignazio,<br>
>><br>
>> Do you currently have HAProxy configured to route requests to multiple MariaDB nodes? If so, as a workaround, try doing an active/backup configuration where all but 1 node is configured as an HAProxy "backup".<br>
>><br>
>> Thanks,<br>
>> Joe<br>
>><br>
>><br>
>><br>
>> On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano <<a href="mailto:ignaziocassano@gmail.com" target="_blank" rel="noreferrer">ignaziocassano@gmail.com</a>> wrote:<br>
>>><br>
>>> Hello Zane, it happens also when I create a magnum cluster .<br>
>>> My mariadb cluster is behind haproxy.<br>
>>> Do you need any other info?<br>
>>> Thanks<br>
>>> Ignazio<br>
>>><br>
>>><br>
>>> Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <<a href="mailto:zbitter@redhat.com" target="_blank" rel="noreferrer">zbitter@redhat.com</a>> ha scritto:<br>
>>>><br>
>>>> On 14/12/18 4:06 AM, Ignazio Cassano wrote:<br>
>>>> > Hi All,<br>
>>>> > I installed queens on centos 7.<br>
>>>> > Heat seems to work fine with templates I wrote but when I create magnum<br>
>>>> > cluster<br>
>>>> > I often face with db deadlock in heat-engine log:<br>
>>>><br>
>>>> The stacktrace below is in stack delete, so do you mean that the problem<br>
>>>> occurs when deleting a Magnum cluster, or can it occur any time with a<br>
>>>> magnum cluster?<br>
>>>><br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most<br>
>>>> > recent call last):<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File<br>
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in<br>
>>>> > _action_recorder<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File<br>
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035,<br>
>>>> > in delete<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args)<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File<br>
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346,<br>
>>>> > in wrapper<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step =<br>
>>>> > next(subtask)<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File<br>
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in<br>
>>>> > action_handler_task<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done =<br>
>>>> > check(handler_data)<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File<br>
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",<br>
>>>> > line 587, in check_delete_complete<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return<br>
>>>> > self._check_status_complete(self.DELETE)<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File<br>
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",<br>
>>>> > line 454, in _check_status_complete<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action)<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource<br>
>>>> > ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213,<br>
>>>> > u'Deadlock found when trying to get lock; try restarting transaction')<br>
>>>> > (Background on this error at: <a href="http://sqlalche.me/e/2j85" rel="noreferrer noreferrer" target="_blank">http://sqlalche.me/e/2j85</a>)<br>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource<br>
>>>> > 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack<br>
>>>> > [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default]<br>
>>>> > Stack DELETE FAILED<br>
>>>> > (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw):<br>
>>>> > Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213,<br>
>>>> > u'Deadlock found when trying to get lock; try restarting transaction')<br>
>>>> > (Background on this error at: <a href="http://sqlalche.me/e/2j85" rel="noreferrer noreferrer" target="_blank">http://sqlalche.me/e/2j85</a>)<br>
>>>> > 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource<br>
>>>> > [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default]<br>
>>>> > DELETE: ResourceGroup "swarm_primary_master"<br>
>>>> > [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack<br>
>>>> > "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]<br>
>>>> ><br>
>>>> ><br>
>>>> > I read this issue was solved in heat version 10 but seems not.<br>
>>>><br>
>>>> The patch for bug 1732969, which is in Queens, is probably the fix<br>
>>>> you're referring to: <a href="https://review.openstack.org/#/c/521170/1" rel="noreferrer noreferrer" target="_blank">https://review.openstack.org/#/c/521170/1</a><br>
>>>><br>
>>>> The traceback in that bug included the SQL statement that was failing.<br>
>>>> I'm not sure if that isn't included in your traceback because SQLAlchemy<br>
>>>> didn't report it, or if it's because that traceback is actually from the<br>
>>>> parent stack. If you have a traceback from the original failure in the<br>
>>>> child stack that would be useful. If there's a way to turn on more<br>
>>>> detailed reporting of errors in SQLAlchemy that would also be useful.<br>
>>>><br>
>>>> Since this is a delete, it's possible that we need a retry-on-deadlock<br>
>>>> on the resource_delete() function also (though resource_update() is<br>
>>>> actually used more during a delete to update the status).<br>
>>>><br>
>>>> - ZB<br>
>>>><br>
</blockquote></div>