queens heat db deadlock

Ignazio Cassano

13 Dec 2018 13 Dec '18

3:06 p.m.

Hi All, I installed queens on centos 7. Heat seems to work fine with templates I wrote but when I create magnum cluster I often face with db deadlock in heat-engine log: 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in _action_recorder 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035, in delete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = next(subtask) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in action_handler_task 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = check(handler_data) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 587, in check_delete_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return self._check_status_complete(self.DELETE) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 454, in _check_status_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] Stack DELETE FAILED (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] DELETE: ResourceGroup "swarm_primary_master" [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be] I read this issue was solved in heat version 10 but seems not. Regards Ignazio

Attachments:

attachment.html (text/html — 3.2 KB)

Show replies by date

Zane Bitter

16 Dec 16 Dec

11:24 p.m.

On 14/12/18 4:06 AM, Ignazio Cassano wrote:

...

Hi All, I installed queens on centos 7. Heat seems to work fine with templates I wrote but when I create magnum cluster I often face with db deadlock in heat-engine log:

The stacktrace below is in stack delete, so do you mean that the problem occurs when deleting a Magnum cluster, or can it occur any time with a magnum cluster?

...

2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in _action_recorder 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035, in delete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = next(subtask) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in action_handler_task 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = check(handler_data) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 587, in check_delete_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return self._check_status_complete(self.DELETE) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 454, in _check_status_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] Stack DELETE FAILED (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] DELETE: ResourceGroup "swarm_primary_master" [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]

I read this issue was solved in heat version 10 but seems not.

The patch for bug 1732969, which is in Queens, is probably the fix you're referring to: https://review.openstack.org/#/c/521170/1 The traceback in that bug included the SQL statement that was failing. I'm not sure if that isn't included in your traceback because SQLAlchemy didn't report it, or if it's because that traceback is actually from the parent stack. If you have a traceback from the original failure in the child stack that would be useful. If there's a way to turn on more detailed reporting of errors in SQLAlchemy that would also be useful. Since this is a delete, it's possible that we need a retry-on-deadlock on the resource_delete() function also (though resource_update() is actually used more during a delete to update the status). - ZB

Ignazio Cassano

17 Dec 17 Dec

5:44 a.m.

Hello Zane, it happens also when I create a magnum cluster . My mariadb cluster is behind haproxy. Do you need any other info? Thanks Ignazio Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter@redhat.com> ha scritto:

...

On 14/12/18 4:06 AM, Ignazio Cassano wrote:

...
Hi All, I installed queens on centos 7. Heat seems to work fine with templates I wrote but when I create magnum cluster I often face with db deadlock in heat-engine log:

The stacktrace below is in stack delete, so do you mean that the problem occurs when deleting a Magnum cluster, or can it occur any time with a magnum cluster?

...
2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in _action_recorder 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035, in delete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = next(subtask) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in action_handler_task 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = check(handler_data) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 587, in check_delete_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return self._check_status_complete(self.DELETE) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 454, in _check_status_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] Stack DELETE FAILED (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] DELETE: ResourceGroup "swarm_primary_master" [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]

I read this issue was solved in heat version 10 but seems not.

The patch for bug 1732969, which is in Queens, is probably the fix you're referring to: https://review.openstack.org/#/c/521170/1

The traceback in that bug included the SQL statement that was failing. I'm not sure if that isn't included in your traceback because SQLAlchemy didn't report it, or if it's because that traceback is actually from the parent stack. If you have a traceback from the original failure in the child stack that would be useful. If there's a way to turn on more detailed reporting of errors in SQLAlchemy that would also be useful.

Since this is a delete, it's possible that we need a retry-on-deadlock on the resource_delete() function also (though resource_update() is actually used more during a delete to update the status).

- ZB

Joe Topjian

7:37 p.m.

Hi Ignazio, Do you currently have HAProxy configured to route requests to multiple MariaDB nodes? If so, as a workaround, try doing an active/backup configuration where all but 1 node is configured as an HAProxy "backup". Thanks, Joe On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...

Hello Zane, it happens also when I create a magnum cluster . My mariadb cluster is behind haproxy. Do you need any other info? Thanks Ignazio

Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter@redhat.com> ha scritto:

...
On 14/12/18 4:06 AM, Ignazio Cassano wrote:

...
Hi All, I installed queens on centos 7. Heat seems to work fine with templates I wrote but when I create magnum cluster I often face with db deadlock in heat-engine log:

The stacktrace below is in stack delete, so do you mean that the problem occurs when deleting a Magnum cluster, or can it occur any time with a magnum cluster?

...
2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in _action_recorder 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035, in delete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = next(subtask) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in action_handler_task 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = check(handler_data) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 587, in check_delete_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return self._check_status_complete(self.DELETE) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 454, in _check_status_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] Stack DELETE FAILED (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] DELETE: ResourceGroup "swarm_primary_master" [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]

I read this issue was solved in heat version 10 but seems not.

The patch for bug 1732969, which is in Queens, is probably the fix you're referring to: https://review.openstack.org/#/c/521170/1

The traceback in that bug included the SQL statement that was failing. I'm not sure if that isn't included in your traceback because SQLAlchemy didn't report it, or if it's because that traceback is actually from the parent stack. If you have a traceback from the original failure in the child stack that would be useful. If there's a way to turn on more detailed reporting of errors in SQLAlchemy that would also be useful.

Since this is a delete, it's possible that we need a retry-on-deadlock on the resource_delete() function also (though resource_update() is actually used more during a delete to update the status).

- ZB

Ignazio Cassano

18 Dec 18 Dec

5:28 a.m.

Yes, I tried on yesterday and this workaround solved. Thanks Ignazio Il giorno Lun 17 Dic 2018 20:38 Joe Topjian <joe@topjian.net> ha scritto:

...

Hi Ignazio,

Do you currently have HAProxy configured to route requests to multiple MariaDB nodes? If so, as a workaround, try doing an active/backup configuration where all but 1 node is configured as an HAProxy "backup".

Thanks, Joe

On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Hello Zane, it happens also when I create a magnum cluster . My mariadb cluster is behind haproxy. Do you need any other info? Thanks Ignazio

Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter@redhat.com> ha scritto:

...
On 14/12/18 4:06 AM, Ignazio Cassano wrote:

...
Hi All, I installed queens on centos 7. Heat seems to work fine with templates I wrote but when I create magnum cluster I often face with db deadlock in heat-engine log:

The stacktrace below is in stack delete, so do you mean that the problem occurs when deleting a Magnum cluster, or can it occur any time with a magnum cluster?

...
2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in _action_recorder 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035, in delete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = next(subtask) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in action_handler_task 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = check(handler_data) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 587, in check_delete_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return self._check_status_complete(self.DELETE) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 454, in _check_status_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] Stack DELETE FAILED (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] DELETE: ResourceGroup "swarm_primary_master" [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]

I read this issue was solved in heat version 10 but seems not.

The patch for bug 1732969, which is in Queens, is probably the fix you're referring to: https://review.openstack.org/#/c/521170/1

The traceback in that bug included the SQL statement that was failing. I'm not sure if that isn't included in your traceback because SQLAlchemy didn't report it, or if it's because that traceback is actually from the parent stack. If you have a traceback from the original failure in the child stack that would be useful. If there's a way to turn on more detailed reporting of errors in SQLAlchemy that would also be useful.

Since this is a delete, it's possible that we need a retry-on-deadlock on the resource_delete() function also (though resource_update() is actually used more during a delete to update the status).

- ZB

Mike Bayer

4:06 p.m.

On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...

Yes, I tried on yesterday and this workaround solved. Thanks Ignazio

OK, so that means this "deadlock" is not really a deadlock but it is a write-conflict between two Galera masters. I have a long term goal to being relaxing this common requirement that Openstack apps only refer to one Galera master at a time. If this is a particular hotspot for Heat (no pun intended) can we pursue adding a transaction retry decorator for this operation? This is the standard approach for other applications that are subject to galera multi-master writeset conflicts such as Neutron.

...

Il giorno Lun 17 Dic 2018 20:38 Joe Topjian <joe@topjian.net> ha scritto:

...
Hi Ignazio,

Do you currently have HAProxy configured to route requests to multiple MariaDB nodes? If so, as a workaround, try doing an active/backup configuration where all but 1 node is configured as an HAProxy "backup".

Thanks, Joe

On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Hello Zane, it happens also when I create a magnum cluster . My mariadb cluster is behind haproxy. Do you need any other info? Thanks Ignazio

Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter@redhat.com> ha scritto:

...
On 14/12/18 4:06 AM, Ignazio Cassano wrote:

...
Hi All, I installed queens on centos 7. Heat seems to work fine with templates I wrote but when I create magnum cluster I often face with db deadlock in heat-engine log:

The stacktrace below is in stack delete, so do you mean that the problem occurs when deleting a Magnum cluster, or can it occur any time with a magnum cluster?

...
2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in _action_recorder 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035, in delete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = next(subtask) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in action_handler_task 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = check(handler_data) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 587, in check_delete_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return self._check_status_complete(self.DELETE) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 454, in _check_status_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] Stack DELETE FAILED (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] DELETE: ResourceGroup "swarm_primary_master" [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]

I read this issue was solved in heat version 10 but seems not.

The patch for bug 1732969, which is in Queens, is probably the fix you're referring to: https://review.openstack.org/#/c/521170/1

The traceback in that bug included the SQL statement that was failing. I'm not sure if that isn't included in your traceback because SQLAlchemy didn't report it, or if it's because that traceback is actually from the parent stack. If you have a traceback from the original failure in the child stack that would be useful. If there's a way to turn on more detailed reporting of errors in SQLAlchemy that would also be useful.

Since this is a delete, it's possible that we need a retry-on-deadlock on the resource_delete() function also (though resource_update() is actually used more during a delete to update the status).

- ZB

Jay Pipes

5:49 p.m.

On 12/18/2018 11:06 AM, Mike Bayer wrote:

...

On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Yes, I tried on yesterday and this workaround solved. Thanks Ignazio

OK, so that means this "deadlock" is not really a deadlock but it is a write-conflict between two Galera masters. I have a long term goal to being relaxing this common requirement that Openstack apps only refer to one Galera master at a time. If this is a particular hotspot for Heat (no pun intended) can we pursue adding a transaction retry decorator for this operation? This is the standard approach for other applications that are subject to galera multi-master writeset conflicts such as Neutron.

Correct. Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big cause of the aforementioned "deadlocks". Best, -jay

...

...
Il giorno Lun 17 Dic 2018 20:38 Joe Topjian <joe@topjian.net> ha scritto:

...
Hi Ignazio,

Do you currently have HAProxy configured to route requests to multiple MariaDB nodes? If so, as a workaround, try doing an active/backup configuration where all but 1 node is configured as an HAProxy "backup".

Thanks, Joe

On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Hello Zane, it happens also when I create a magnum cluster . My mariadb cluster is behind haproxy. Do you need any other info? Thanks Ignazio

Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter@redhat.com> ha scritto:

...
On 14/12/18 4:06 AM, Ignazio Cassano wrote:

...
Hi All, I installed queens on centos 7. Heat seems to work fine with templates I wrote but when I create magnum cluster I often face with db deadlock in heat-engine log:

The stacktrace below is in stack delete, so do you mean that the problem occurs when deleting a Magnum cluster, or can it occur any time with a magnum cluster?

...
2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in _action_recorder 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035, in delete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = next(subtask) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in action_handler_task 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = check(handler_data) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 587, in check_delete_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return self._check_status_complete(self.DELETE) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 454, in _check_status_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] Stack DELETE FAILED (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] DELETE: ResourceGroup "swarm_primary_master" [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]

I read this issue was solved in heat version 10 but seems not.

The patch for bug 1732969, which is in Queens, is probably the fix you're referring to: https://review.openstack.org/#/c/521170/1

The traceback in that bug included the SQL statement that was failing. I'm not sure if that isn't included in your traceback because SQLAlchemy didn't report it, or if it's because that traceback is actually from the parent stack. If you have a traceback from the original failure in the child stack that would be useful. If there's a way to turn on more detailed reporting of errors in SQLAlchemy that would also be useful.

Since this is a delete, it's possible that we need a retry-on-deadlock on the resource_delete() function also (though resource_update() is actually used more during a delete to update the status).

- ZB

Zane Bitter

20 Dec 20 Dec

7:01 a.m.

On 19/12/18 6:49 AM, Jay Pipes wrote:

...

On 12/18/2018 11:06 AM, Mike Bayer wrote:

...
On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Yes, I tried on yesterday and this workaround solved. Thanks Ignazio

OK, so that means this "deadlock" is not really a deadlock but it is a write-conflict between two Galera masters. I have a long term goal to being relaxing this common requirement that Openstack apps only refer to one Galera master at a time. If this is a particular hotspot for Heat (no pun intended) can we pursue adding a transaction retry decorator for this operation? This is the standard approach for other applications that are subject to galera multi-master writeset conflicts such as Neutron.

The weird thing about this issue is that we actually have a retry decorator on the operation that I assume is the problem. It was added in Queens and largely fixed this issue in the gate: https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py

...

Correct.

Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big cause of the aforementioned "deadlocks".

AFAIK, no. In fact we were quite careful to design stuff that is expected to be subject to write contention to use UPDATE ... WHERE (by doing query().filter_by().update() in sqlalchemy), but it turned out to be those very statements that were most prone to causing deadlocks in the gate (i.e. we added retry decorators in those two places and the failures went away), according to me in the commit message for that patch: https://review.openstack.org/521170 Are we Doing It Wrong(TM)? thanks, Zane.

...

Best, -jay

...
...
Il giorno Lun 17 Dic 2018 20:38 Joe Topjian <joe@topjian.net> ha scritto:

...
Hi Ignazio,

Do you currently have HAProxy configured to route requests to multiple MariaDB nodes? If so, as a workaround, try doing an active/backup configuration where all but 1 node is configured as an HAProxy "backup".

Thanks, Joe

On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Hello Zane, it happens also when I create a magnum cluster . My mariadb cluster is behind haproxy. Do you need any other info? Thanks Ignazio

Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter@redhat.com> ha scritto:

...
On 14/12/18 4:06 AM, Ignazio Cassano wrote: > Hi All, > I installed queens on centos 7. > Heat seems to work fine with templates I wrote but when I create > magnum > cluster > I often face with db deadlock in heat-engine log:

The stacktrace below is in stack delete, so do you mean that the problem occurs when deleting a Magnum cluster, or can it occur any time with a magnum cluster?

> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource > Traceback (most > recent call last): > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line > 918, in > _action_recorder > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line > 2035, > in delete > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource > *action_args) > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File > "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line > 346, > in wrapper > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = > next(subtask) > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File > "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line > 977, in > action_handler_task > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = > check(handler_data) > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File > "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", > > line 587, in check_delete_complete > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return > self._check_status_complete(self.DELETE) > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File > "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", > > line 454, in _check_status_complete > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource > action=action) > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource > ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, > u'Deadlock found when trying to get lock; try restarting > transaction') > (Background on this error at: http://sqlalche.me/e/2j85) > 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource > 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack > [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] > Stack DELETE FAILED > (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): > Resource DELETE failed: resources[0]: (pymysql.err.InternalError) > (1213, > u'Deadlock found when trying to get lock; try restarting > transaction') > (Background on this error at: http://sqlalche.me/e/2j85) > 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource > [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] > DELETE: ResourceGroup "swarm_primary_master" > [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack > "swarm-clustergp27-ebhsalhb4bop" > [b43256fa-52e2-4613-ac15-63fd9340a8be] > > > I read this issue was solved in heat version 10 but seems not.

The patch for bug 1732969, which is in Queens, is probably the fix you're referring to: https://review.openstack.org/#/c/521170/1

The traceback in that bug included the SQL statement that was failing. I'm not sure if that isn't included in your traceback because SQLAlchemy didn't report it, or if it's because that traceback is actually from the parent stack. If you have a traceback from the original failure in the child stack that would be useful. If there's a way to turn on more detailed reporting of errors in SQLAlchemy that would also be useful.

Since this is a delete, it's possible that we need a retry-on-deadlock on the resource_delete() function also (though resource_update() is actually used more during a delete to update the status).

- ZB

Jay Pipes

1:07 p.m.

On 12/20/2018 02:01 AM, Zane Bitter wrote:

...

On 19/12/18 6:49 AM, Jay Pipes wrote:

...
On 12/18/2018 11:06 AM, Mike Bayer wrote:

...
On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Yes, I tried on yesterday and this workaround solved. Thanks Ignazio

OK, so that means this "deadlock" is not really a deadlock but it is a write-conflict between two Galera masters. I have a long term goal to being relaxing this common requirement that Openstack apps only refer to one Galera master at a time. If this is a particular hotspot for Heat (no pun intended) can we pursue adding a transaction retry decorator for this operation? This is the standard approach for other applications that are subject to galera multi-master writeset conflicts such as Neutron.

The weird thing about this issue is that we actually have a retry decorator on the operation that I assume is the problem. It was added in Queens and largely fixed this issue in the gate:

https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py

...
Correct.

Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big cause of the aforementioned "deadlocks".

AFAIK, no. In fact we were quite careful to design stuff that is expected to be subject to write contention to use UPDATE ... WHERE (by doing query().filter_by().update() in sqlalchemy), but it turned out to be those very statements that were most prone to causing deadlocks in the gate (i.e. we added retry decorators in those two places and the failures went away), according to me in the commit message for that patch: https://review.openstack.org/521170

Are we Doing It Wrong(TM)?

No, it looks to me like you're doing things correctly. The OP mentioned that this only happens when deleting a Magnum cluster -- and that it doesn't occur in normal Heat template usage. I wonder (as I really don't know anything about Magnum, unfortunately), is there something different about the Magnum cluster resource handling in Heat that might be causing the wonkiness? Best, -jay

Zane Bitter

2 Jan 2 Jan

6:25 a.m.

On 21/12/18 2:07 AM, Jay Pipes wrote:

...

On 12/20/2018 02:01 AM, Zane Bitter wrote:

...
On 19/12/18 6:49 AM, Jay Pipes wrote:

...
On 12/18/2018 11:06 AM, Mike Bayer wrote:

...
On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Yes, I tried on yesterday and this workaround solved. Thanks Ignazio

OK, so that means this "deadlock" is not really a deadlock but it is a write-conflict between two Galera masters. I have a long term goal to being relaxing this common requirement that Openstack apps only refer to one Galera master at a time. If this is a particular hotspot for Heat (no pun intended) can we pursue adding a transaction retry decorator for this operation? This is the standard approach for other applications that are subject to galera multi-master writeset conflicts such as Neutron.

The weird thing about this issue is that we actually have a retry decorator on the operation that I assume is the problem. It was added in Queens and largely fixed this issue in the gate:

https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py

...
Correct.

Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big cause of the aforementioned "deadlocks".

AFAIK, no. In fact we were quite careful to design stuff that is expected to be subject to write contention to use UPDATE ... WHERE (by doing query().filter_by().update() in sqlalchemy), but it turned out to be those very statements that were most prone to causing deadlocks in the gate (i.e. we added retry decorators in those two places and the failures went away), according to me in the commit message for that patch: https://review.openstack.org/521170

Are we Doing It Wrong(TM)?

No, it looks to me like you're doing things correctly. The OP mentioned that this only happens when deleting a Magnum cluster -- and that it doesn't occur in normal Heat template usage.

I wonder (as I really don't know anything about Magnum, unfortunately), is there something different about the Magnum cluster resource handling in Heat that might be causing the wonkiness?

There's no special-casing for Magnum within Heat. It's likely to be just that there's a lot of resources in a Magnum cluster - or more specifically, a lot of edges in the resource graph, which leads to more write contention (and, in a multi-master setup, more write conflicts). I'd assume that any similarly-complex template would have the same issues, and that Ignazio just didn't have anything else that complex to hand. That gives me an idea, though. I wonder if this would help: https://review.openstack.org/627914 Ignazio, could you possibly test with that ^ patch in multi-master mode to see if it resolves the issue? cheers, Zane.

Ignazio Cassano

7:55 a.m.

Hello, I'll try as soon as possible ans I will send you a response. Ignazio Il giorno mer 2 gen 2019 alle ore 07:28 Zane Bitter <zbitter@redhat.com> ha scritto:

...

On 21/12/18 2:07 AM, Jay Pipes wrote:

...
On 12/20/2018 02:01 AM, Zane Bitter wrote:

...
On 19/12/18 6:49 AM, Jay Pipes wrote:

...
On 12/18/2018 11:06 AM, Mike Bayer wrote:

...
On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Yes, I tried on yesterday and this workaround solved. Thanks Ignazio

OK, so that means this "deadlock" is not really a deadlock but it is a write-conflict between two Galera masters. I have a long term goal to being relaxing this common requirement that Openstack apps only refer to one Galera master at a time. If this is a particular hotspot for Heat (no pun intended) can we pursue adding a transaction retry decorator for this operation? This is the standard approach for other applications that are subject to galera multi-master writeset conflicts such as Neutron.

The weird thing about this issue is that we actually have a retry decorator on the operation that I assume is the problem. It was added in Queens and largely fixed this issue in the gate:

https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py

...
Correct.

Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big cause of the aforementioned "deadlocks".

AFAIK, no. In fact we were quite careful to design stuff that is expected to be subject to write contention to use UPDATE ... WHERE (by doing query().filter_by().update() in sqlalchemy), but it turned out to be those very statements that were most prone to causing deadlocks in the gate (i.e. we added retry decorators in those two places and the failures went away), according to me in the commit message for that patch: https://review.openstack.org/521170

Are we Doing It Wrong(TM)?

No, it looks to me like you're doing things correctly. The OP mentioned that this only happens when deleting a Magnum cluster -- and that it doesn't occur in normal Heat template usage.

I wonder (as I really don't know anything about Magnum, unfortunately), is there something different about the Magnum cluster resource handling in Heat that might be causing the wonkiness?

There's no special-casing for Magnum within Heat. It's likely to be just that there's a lot of resources in a Magnum cluster - or more specifically, a lot of edges in the resource graph, which leads to more write contention (and, in a multi-master setup, more write conflicts). I'd assume that any similarly-complex template would have the same issues, and that Ignazio just didn't have anything else that complex to hand.

That gives me an idea, though. I wonder if this would help:

https://review.openstack.org/627914

Ignazio, could you possibly test with that ^ patch in multi-master mode to see if it resolves the issue?

cheers, Zane.

Ignazio Cassano

11:27 a.m.

Hello Zane, we applyed the patch and modified our haproxy : unfortunately it does not solve db deadlock issue. Ignazio & Gianpiero Il giorno mer 2 gen 2019 alle ore 07:28 Zane Bitter <zbitter@redhat.com> ha scritto:

...

On 21/12/18 2:07 AM, Jay Pipes wrote:

...
On 12/20/2018 02:01 AM, Zane Bitter wrote:

...
On 19/12/18 6:49 AM, Jay Pipes wrote:

...
On 12/18/2018 11:06 AM, Mike Bayer wrote:

...
On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Yes, I tried on yesterday and this workaround solved. Thanks Ignazio

OK, so that means this "deadlock" is not really a deadlock but it is a write-conflict between two Galera masters. I have a long term goal to being relaxing this common requirement that Openstack apps only refer to one Galera master at a time. If this is a particular hotspot for Heat (no pun intended) can we pursue adding a transaction retry decorator for this operation? This is the standard approach for other applications that are subject to galera multi-master writeset conflicts such as Neutron.

The weird thing about this issue is that we actually have a retry decorator on the operation that I assume is the problem. It was added in Queens and largely fixed this issue in the gate:

https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py

...
Correct.

Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big cause of the aforementioned "deadlocks".

AFAIK, no. In fact we were quite careful to design stuff that is expected to be subject to write contention to use UPDATE ... WHERE (by doing query().filter_by().update() in sqlalchemy), but it turned out to be those very statements that were most prone to causing deadlocks in the gate (i.e. we added retry decorators in those two places and the failures went away), according to me in the commit message for that patch: https://review.openstack.org/521170

Are we Doing It Wrong(TM)?

No, it looks to me like you're doing things correctly. The OP mentioned that this only happens when deleting a Magnum cluster -- and that it doesn't occur in normal Heat template usage.

I wonder (as I really don't know anything about Magnum, unfortunately), is there something different about the Magnum cluster resource handling in Heat that might be causing the wonkiness?

There's no special-casing for Magnum within Heat. It's likely to be just that there's a lot of resources in a Magnum cluster - or more specifically, a lot of edges in the resource graph, which leads to more write contention (and, in a multi-master setup, more write conflicts). I'd assume that any similarly-complex template would have the same issues, and that Ignazio just didn't have anything else that complex to hand.

That gives me an idea, though. I wonder if this would help:

https://review.openstack.org/627914

Ignazio, could you possibly test with that ^ patch in multi-master mode to see if it resolves the issue?

cheers, Zane.

Ignazio Cassano

18 Dec 18 Dec

6:23 p.m.

Hello, seems deadlock happens when heat craates network objects. I am not sure, but if I remember some db deadlock also in cinder. Any case with the workaround heat never fails for db deadlock and Now stack deleting do not stop. Ignazio Il giorno Mar 18 Dic 2018 17:06 Mike Bayer <mike_mp@zzzcomputing.com> ha scritto:

...

On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano <ignaziocassano@gmail.com> wrote:

...
Yes, I tried on yesterday and this workaround solved. Thanks Ignazio

OK, so that means this "deadlock" is not really a deadlock but it is a write-conflict between two Galera masters. I have a long term goal to being relaxing this common requirement that Openstack apps only refer to one Galera master at a time. If this is a particular hotspot for Heat (no pun intended) can we pursue adding a transaction retry decorator for this operation? This is the standard approach for other applications that are subject to galera multi-master writeset conflicts such as Neutron.

...
Il giorno Lun 17 Dic 2018 20:38 Joe Topjian <joe@topjian.net> ha

...
...
Hi Ignazio,

Do you currently have HAProxy configured to route requests to multiple

MariaDB nodes? If so, as a workaround, try doing an active/backup configuration where all but 1 node is configured as an HAProxy "backup".

...
Thanks, Joe

On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano <

ignaziocassano@gmail.com> wrote:

...
...
Hello Zane, it happens also when I create a magnum cluster . My mariadb cluster is behind haproxy. Do you need any other info? Thanks Ignazio

Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter@redhat.com> ha

scritto:

...
...
On 14/12/18 4:06 AM, Ignazio Cassano wrote:

...
Hi All, I installed queens on centos 7. Heat seems to work fine with templates I wrote but when I create

magnum

...
...
cluster I often face with db deadlock in heat-engine log:

The stacktrace below is in stack delete, so do you mean that the

...
...
...
...
occurs when deleting a Magnum cluster, or can it occur any time with a magnum cluster?

...
2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in _action_recorder 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035, in delete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = next(subtask) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in action_handler_task 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = check(handler_data) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 587, in check_delete_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return self._check_status_complete(self.DELETE) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 454, in _check_status_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting

...
...
...
...
...
(Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] Stack DELETE FAILED (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting

...
...
...
...
...
(Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] DELETE: ResourceGroup "swarm_primary_master" [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]

I read this issue was solved in heat version 10 but seems not.

The patch for bug 1732969, which is in Queens, is probably the fix you're referring to: https://review.openstack.org/#/c/521170/1

The traceback in that bug included the SQL statement that was failing. I'm not sure if that isn't included in your traceback because SQLAlchemy didn't report it, or if it's because that traceback is actually from

scritto: problem transaction') transaction') the

...
...
...
...
parent stack. If you have a traceback from the original failure in the child stack that would be useful. If there's a way to turn on more detailed reporting of errors in SQLAlchemy that would also be useful.

Since this is a delete, it's possible that we need a retry-on-deadlock on the resource_delete() function also (though resource_update() is actually used more during a delete to update the status).

- ZB

Ignazio Cassano

17 Dec 17 Dec

5:49 a.m.

Ps I will send logs for stack create asap. Ignazio Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter@redhat.com> ha scritto:

...

On 14/12/18 4:06 AM, Ignazio Cassano wrote:

...
Hi All, I installed queens on centos 7. Heat seems to work fine with templates I wrote but when I create magnum cluster I often face with db deadlock in heat-engine log:

The stacktrace below is in stack delete, so do you mean that the problem occurs when deleting a Magnum cluster, or can it occur any time with a magnum cluster?

...
2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in _action_recorder 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035, in delete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = next(subtask) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in action_handler_task 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = check(handler_data) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 587, in check_delete_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return self._check_status_complete(self.DELETE) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 454, in _check_status_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] Stack DELETE FAILED (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] DELETE: ResourceGroup "swarm_primary_master" [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]

I read this issue was solved in heat version 10 but seems not.

The patch for bug 1732969, which is in Queens, is probably the fix you're referring to: https://review.openstack.org/#/c/521170/1

The traceback in that bug included the SQL statement that was failing. I'm not sure if that isn't included in your traceback because SQLAlchemy didn't report it, or if it's because that traceback is actually from the parent stack. If you have a traceback from the original failure in the child stack that would be useful. If there's a way to turn on more detailed reporting of errors in SQLAlchemy that would also be useful.

Since this is a delete, it's possible that we need a retry-on-deadlock on the resource_delete() function also (though resource_update() is actually used more during a delete to update the status).

- ZB

Ignazio Cassano

7:56 a.m.

Hello, attached here there is part of heat engine log when creating a stack Regards Ignazio Il giorno lun 17 dic 2018 alle ore 00:28 Zane Bitter <zbitter@redhat.com> ha scritto:

...

On 14/12/18 4:06 AM, Ignazio Cassano wrote:

...
Hi All, I installed queens on centos 7. Heat seems to work fine with templates I wrote but when I create magnum cluster I often face with db deadlock in heat-engine log:

The stacktrace below is in stack delete, so do you mean that the problem occurs when deleting a Magnum cluster, or can it occur any time with a magnum cluster?

...
2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource Traceback (most recent call last): 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 918, in _action_recorder 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource yield 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 2035, in delete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource *action_args) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource step = next(subtask) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 977, in action_handler_task 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource done = check(handler_data) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 587, in check_delete_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource return self._check_status_complete(self.DELETE) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource File

"/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py",

...
line 454, in _check_status_complete 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource action=action) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] Stack DELETE FAILED (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw): Resource DELETE failed: resources[0]: (pymysql.err.InternalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/2j85) 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default] DELETE: ResourceGroup "swarm_primary_master" [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack "swarm-clustergp27-ebhsalhb4bop" [b43256fa-52e2-4613-ac15-63fd9340a8be]

I read this issue was solved in heat version 10 but seems not.

The patch for bug 1732969, which is in Queens, is probably the fix you're referring to: https://review.openstack.org/#/c/521170/1

The traceback in that bug included the SQL statement that was failing. I'm not sure if that isn't included in your traceback because SQLAlchemy didn't report it, or if it's because that traceback is actually from the parent stack. If you have a traceback from the original failure in the child stack that would be useful. If there's a way to turn on more detailed reporting of errors in SQLAlchemy that would also be useful.

Since this is a delete, it's possible that we need a retry-on-deadlock on the resource_delete() function also (though resource_update() is actually used more during a delete to update the status).

- ZB

2383

Age (days ago)

2403

Last active (days ago)

List overview

Download

14 comments

5 participants

participants (5)

Ignazio Cassano
Jay Pipes
Joe Topjian
Mike Bayer
Zane Bitter