queens heat db deadlock

Jay Pipes jaypipes at gmail.com
Tue Dec 18 17:49:42 UTC 2018


On 12/18/2018 11:06 AM, Mike Bayer wrote:
> On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano
> <ignaziocassano at gmail.com> wrote:
>>
>> Yes, I  tried on yesterday and this workaround solved.
>> Thanks
>> Ignazio
> 
> OK, so that means this "deadlock" is not really a deadlock but it is a
> write-conflict between two Galera masters.      I have a long term
> goal to being relaxing this common requirement that Openstack apps
> only refer to one Galera master at a time.    If this is a particular
> hotspot for Heat (no pun intended) can we pursue adding a transaction
> retry decorator for this operation?  This is the standard approach for
> other applications that are subject to galera multi-master writeset
> conflicts such as Neutron.

Correct.

Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big cause 
of the aforementioned "deadlocks".

Best,
-jay

>>
>> Il giorno Lun 17 Dic 2018 20:38 Joe Topjian <joe at topjian.net> ha scritto:
>>>
>>> Hi Ignazio,
>>>
>>> Do you currently have HAProxy configured to route requests to multiple MariaDB nodes? If so, as a workaround, try doing an active/backup configuration where all but 1 node is configured as an HAProxy "backup".
>>>
>>> Thanks,
>>> Joe
>>>
>>>
>>>
>>> On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano <ignaziocassano at gmail.com> wrote:
>>>>
>>>> Hello Zane, it happens also when I create a magnum cluster .
>>>> My mariadb cluster is behind haproxy.
>>>> Do you need any other info?
>>>> Thanks
>>>> Ignazio
>>>>
>>>>
>>>> Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter at redhat.com> ha scritto:
>>>>>
>>>>> On 14/12/18 4:06 AM, Ignazio Cassano wrote:
>>>>>> Hi All,
>>>>>> I installed queens on centos 7.
>>>>>> Heat seems to work fine with templates I wrote but when I create magnum
>>>>>> cluster
>>>>>> I often face with db deadlock in heat-engine log:
>>>>>
>>>>> The stacktrace below is in stack delete, so do you mean that the problem
>>>>> occurs when deleting a Magnum cluster, or can it occur any time with a
>>>>> magnum cluster?
>>>>>
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most
>>>>>> recent call last):
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>>>> "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in
>>>>>> _action_recorder
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     yield
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>>>> "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035,
>>>>>> in delete
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     *action_args)
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>>>> "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346,
>>>>>> in wrapper
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     step =
>>>>>> next(subtask)
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>>>> "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in
>>>>>> action_handler_task
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     done =
>>>>>> check(handler_data)
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>>>> "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",
>>>>>> line 587, in check_delete_complete
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     return
>>>>>> self._check_status_complete(self.DELETE)
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
>>>>>> "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",
>>>>>> line 454, in _check_status_complete
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     action=action)
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource
>>>>>> ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213,
>>>>>> u'Deadlock found when trying to get lock; try restarting transaction')
>>>>>> (Background on this error at: http://sqlalche.me/e/2j85)
>>>>>> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource
>>>>>> 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack
>>>>>> [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default]
>>>>>> Stack DELETE FAILED
>>>>>> (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw):
>>>>>> Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213,
>>>>>> u'Deadlock found when trying to get lock; try restarting transaction')
>>>>>> (Background on this error at: http://sqlalche.me/e/2j85)
>>>>>> 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource
>>>>>> [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default]
>>>>>> DELETE: ResourceGroup "swarm_primary_master"
>>>>>> [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack
>>>>>> "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]
>>>>>>
>>>>>>
>>>>>> I read this issue was solved in heat version 10 but seems not.
>>>>>
>>>>> The patch for bug 1732969, which is in Queens, is probably the fix
>>>>> you're referring to: https://review.openstack.org/#/c/521170/1
>>>>>
>>>>> The traceback in that bug included the SQL statement that was failing.
>>>>> I'm not sure if that isn't included in your traceback because SQLAlchemy
>>>>> didn't report it, or if it's because that traceback is actually from the
>>>>> parent stack. If you have a traceback from the original failure in the
>>>>> child stack that would be useful. If there's a way to turn on more
>>>>> detailed reporting of errors in SQLAlchemy that would also be useful.
>>>>>
>>>>> Since this is a delete, it's possible that we need a retry-on-deadlock
>>>>> on the resource_delete() function also (though resource_update() is
>>>>> actually used more during a delete to update the status).
>>>>>
>>>>> - ZB
>>>>>
> 



More information about the openstack-discuss mailing list