<div dir="auto">Hello,  seems deadlock happens when heat craates network objects.<div dir="auto">I am not sure, but if I remember some db deadlock also in cinder.</div><div dir="auto">Any case with the workaround heat never fails for db deadlock and Now stack deleting do not stop.</div><div dir="auto">Ignazio</div></div><br><div class="gmail_quote"><div dir="ltr">Il giorno Mar 18 Dic 2018 17:06 Mike Bayer <<a href="mailto:mike_mp@zzzcomputing.com">mike_mp@zzzcomputing.com</a>> ha scritto:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano<br>

<<a href="mailto:ignaziocassano@gmail.com" target="_blank" rel="noreferrer">ignaziocassano@gmail.com</a>> wrote:<br>

><br>

> Yes, I  tried on yesterday and this workaround solved.<br>

> Thanks<br>

> Ignazio<br>

<br>

OK, so that means this "deadlock" is not really a deadlock but it is a<br>

write-conflict between two Galera masters.      I have a long term<br>

goal to being relaxing this common requirement that Openstack apps<br>

only refer to one Galera master at a time.    If this is a particular<br>

hotspot for Heat (no pun intended) can we pursue adding a transaction<br>

retry decorator for this operation?  This is the standard approach for<br>

other applications that are subject to galera multi-master writeset<br>

conflicts such as Neutron.<br>

<br>

<br>

<br>

<br>

><br>

> Il giorno Lun 17 Dic 2018 20:38 Joe Topjian <<a href="mailto:joe@topjian.net" target="_blank" rel="noreferrer">joe@topjian.net</a>> ha scritto:<br>

>><br>

>> Hi Ignazio,<br>

>><br>

>> Do you currently have HAProxy configured to route requests to multiple MariaDB nodes? If so, as a workaround, try doing an active/backup configuration where all but 1 node is configured as an HAProxy "backup".<br>

>><br>

>> Thanks,<br>

>> Joe<br>

>><br>

>><br>

>><br>

>> On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano <<a href="mailto:ignaziocassano@gmail.com" target="_blank" rel="noreferrer">ignaziocassano@gmail.com</a>> wrote:<br>

>>><br>

>>> Hello Zane, it happens also when I create a magnum cluster .<br>

>>> My mariadb cluster is behind haproxy.<br>

>>> Do you need any other info?<br>

>>> Thanks<br>

>>> Ignazio<br>

>>><br>

>>><br>

>>> Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <<a href="mailto:zbitter@redhat.com" target="_blank" rel="noreferrer">zbitter@redhat.com</a>> ha scritto:<br>

>>>><br>

>>>> On 14/12/18 4:06 AM, Ignazio Cassano wrote:<br>

>>>> > Hi All,<br>

>>>> > I installed queens on centos 7.<br>

>>>> > Heat seems to work fine with templates I wrote but when I create magnum<br>

>>>> > cluster<br>

>>>> > I often face with db deadlock in heat-engine log:<br>

>>>><br>

>>>> The stacktrace below is in stack delete, so do you mean that the problem<br>

>>>> occurs when deleting a Magnum cluster, or can it occur any time with a<br>

>>>> magnum cluster?<br>

>>>><br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most<br>

>>>> > recent call last):<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File<br>

>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in<br>

>>>> > _action_recorder<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     yield<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File<br>

>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035,<br>

>>>> > in delete<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     *action_args)<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File<br>

>>>> > "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346,<br>

>>>> > in wrapper<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     step =<br>

>>>> > next(subtask)<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File<br>

>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in<br>

>>>> > action_handler_task<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     done =<br>

>>>> > check(handler_data)<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File<br>

>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",<br>

>>>> > line 587, in check_delete_complete<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     return<br>

>>>> > self._check_status_complete(self.DELETE)<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File<br>

>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",<br>

>>>> > line 454, in _check_status_complete<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     action=action)<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource<br>

>>>> > ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213,<br>

>>>> > u'Deadlock found when trying to get lock; try restarting transaction')<br>

>>>> > (Background on this error at: <a href="http://sqlalche.me/e/2j85" rel="noreferrer noreferrer" target="_blank">http://sqlalche.me/e/2j85</a>)<br>

>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource<br>

>>>> > 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack<br>

>>>> > [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default]<br>

>>>> > Stack DELETE FAILED<br>

>>>> > (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw):<br>

>>>> > Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213,<br>

>>>> > u'Deadlock found when trying to get lock; try restarting transaction')<br>

>>>> > (Background on this error at: <a href="http://sqlalche.me/e/2j85" rel="noreferrer noreferrer" target="_blank">http://sqlalche.me/e/2j85</a>)<br>

>>>> > 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource<br>

>>>> > [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default]<br>

>>>> > DELETE: ResourceGroup "swarm_primary_master"<br>

>>>> > [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack<br>

>>>> > "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]<br>

>>>> ><br>

>>>> ><br>

>>>> > I read this issue was solved in heat version 10 but seems not.<br>

>>>><br>

>>>> The patch for bug 1732969, which is in Queens, is probably the fix<br>

>>>> you're referring to: <a href="https://review.openstack.org/#/c/521170/1" rel="noreferrer noreferrer" target="_blank">https://review.openstack.org/#/c/521170/1</a><br>

>>>><br>

>>>> The traceback in that bug included the SQL statement that was failing.<br>

>>>> I'm not sure if that isn't included in your traceback because SQLAlchemy<br>

>>>> didn't report it, or if it's because that traceback is actually from the<br>

>>>> parent stack. If you have a traceback from the original failure in the<br>

>>>> child stack that would be useful. If there's a way to turn on more<br>

>>>> detailed reporting of errors in SQLAlchemy that would also be useful.<br>

>>>><br>

>>>> Since this is a delete, it's possible that we need a retry-on-deadlock<br>

>>>> on the resource_delete() function also (though resource_update() is<br>

>>>> actually used more during a delete to update the status).<br>

>>>><br>

>>>> - ZB<br>

>>>><br>

</blockquote></div>