queens heat db deadlock

Mike Bayer mike_mp at zzzcomputing.com
Tue Dec 18 16:06:19 UTC 2018


On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano
<ignaziocassano at gmail.com> wrote:
>
> Yes, I  tried on yesterday and this workaround solved.
> Thanks
> Ignazio

OK, so that means this "deadlock" is not really a deadlock but it is a
write-conflict between two Galera masters.      I have a long term
goal to being relaxing this common requirement that Openstack apps
only refer to one Galera master at a time.    If this is a particular
hotspot for Heat (no pun intended) can we pursue adding a transaction
retry decorator for this operation?  This is the standard approach for
other applications that are subject to galera multi-master writeset
conflicts such as Neutron.




>
> Il giorno Lun 17 Dic 2018 20:38 Joe Topjian <joe at topjian.net> ha scritto:
>>
>> Hi Ignazio,
>>
>> Do you currently have HAProxy configured to route requests to multiple MariaDB nodes? If so, as a workaround, try doing an active/backup configuration where all but 1 node is configured as an HAProxy "backup".
>>
>> Thanks,
>> Joe
>>
>>
>>
>> On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano <ignaziocassano at gmail.com> wrote:
>>>
>>> Hello Zane, it happens also when I create a magnum cluster .
>>> My mariadb cluster is behind haproxy.
>>> Do you need any other info?
>>> Thanks
>>> Ignazio
>>>
>>>
>>> Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter at redhat.com> ha scritto:
>>>>
>>>> On 14/12/18 4:06 AM, Ignazio Cassano wrote:
>>>> > Hi All,
>>>> > I installed queens on centos 7.
>>>> > Heat seems to work fine with templates I wrote but when I create magnum
>>>> > cluster
>>>> > I often face with db deadlock in heat-engine log:
>>>>
>>>> The stacktrace below is in stack delete, so do you mean that the problem
>>>> occurs when deleting a Magnum cluster, or can it occur any time with a
>>>> magnum cluster?
>>>>
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most
>>>> > recent call last):
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in
>>>> > _action_recorder
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     yield
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035,
>>>> > in delete
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     *action_args)
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346,
>>>> > in wrapper
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     step =
>>>> > next(subtask)
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in
>>>> > action_handler_task
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     done =
>>>> > check(handler_data)
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",
>>>> > line 587, in check_delete_complete
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     return
>>>> > self._check_status_complete(self.DELETE)
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>> > "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",
>>>> > line 454, in _check_status_complete
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     action=action)
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource
>>>> > ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213,
>>>> > u'Deadlock found when trying to get lock; try restarting transaction')
>>>> > (Background on this error at: http://sqlalche.me/e/2j85)
>>>> > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource
>>>> > 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack
>>>> > [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default]
>>>> > Stack DELETE FAILED
>>>> > (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw):
>>>> > Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213,
>>>> > u'Deadlock found when trying to get lock; try restarting transaction')
>>>> > (Background on this error at: http://sqlalche.me/e/2j85)
>>>> > 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource
>>>> > [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default]
>>>> > DELETE: ResourceGroup "swarm_primary_master"
>>>> > [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack
>>>> > "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]
>>>> >
>>>> >
>>>> > I read this issue was solved in heat version 10 but seems not.
>>>>
>>>> The patch for bug 1732969, which is in Queens, is probably the fix
>>>> you're referring to: https://review.openstack.org/#/c/521170/1
>>>>
>>>> The traceback in that bug included the SQL statement that was failing.
>>>> I'm not sure if that isn't included in your traceback because SQLAlchemy
>>>> didn't report it, or if it's because that traceback is actually from the
>>>> parent stack. If you have a traceback from the original failure in the
>>>> child stack that would be useful. If there's a way to turn on more
>>>> detailed reporting of errors in SQLAlchemy that would also be useful.
>>>>
>>>> Since this is a delete, it's possible that we need a retry-on-deadlock
>>>> on the resource_delete() function also (though resource_update() is
>>>> actually used more during a delete to update the status).
>>>>
>>>> - ZB
>>>>



More information about the openstack-discuss mailing list