Re: queens heat db deadlock

20 Dec 2018

      On 19/12/18 6:49 AM, Jay Pipes wrote:
...
On 12/18/2018 11:06 AM, Mike Bayer wrote:
...
On Tue, Dec 18, 2018 at 12:36 AM Ignazio Cassano
<ignaziocassano@gmail.com> wrote:
...
Yes, I  tried on yesterday and this workaround solved.
Thanks
Ignazio
OK, so that means this "deadlock" is not really a deadlock but it is a
write-conflict between two Galera masters.      I have a long term
goal to being relaxing this common requirement that Openstack apps
only refer to one Galera master at a time.    If this is a particular
hotspot for Heat (no pun intended) can we pursue adding a transaction
retry decorator for this operation?  This is the standard approach for
other applications that are subject to galera multi-master writeset
conflicts such as Neutron.
The weird thing about this issue is that we actually have a retry 
decorator on the operation that I assume is the problem. It was added in 
Queens and largely fixed this issue in the gate:

https://review.openstack.org/#/c/521170/1/heat/db/sqlalchemy/api.py
...
Correct.
Heat doesn't use SELECT .. FOR UPDATE does it? That's also a big cause 
of the aforementioned "deadlocks".
AFAIK, no. In fact we were quite careful to design stuff that is 
expected to be subject to write contention to use UPDATE ... WHERE (by 
doing query().filter_by().update() in sqlalchemy), but it turned out to 
be those very statements that were most prone to causing deadlocks in 
the gate (i.e. we added retry decorators in those two places and the 
failures went away), according to me in the commit message for that 
patch: https://review.openstack.org/521170

Are we Doing It Wrong(TM)?

thanks,
Zane.
...
Best,
-jay
...
...
Il giorno Lun 17 Dic 2018 20:38 Joe Topjian <joe@topjian.net> ha 
scritto:
...
Hi Ignazio,
Do you currently have HAProxy configured to route requests to 
multiple MariaDB nodes? If so, as a workaround, try doing an 
active/backup configuration where all but 1 node is configured as an 
HAProxy "backup".
Thanks,
Joe
On Sun, Dec 16, 2018 at 10:46 PM Ignazio Cassano 
<ignaziocassano@gmail.com> wrote:
...
Hello Zane, it happens also when I create a magnum cluster .
My mariadb cluster is behind haproxy.
Do you need any other info?
Thanks
Ignazio
Il giorno Lun 17 Dic 2018 00:28 Zane Bitter <zbitter@redhat.com> ha 
scritto:
...
On 14/12/18 4:06 AM, Ignazio Cassano wrote:
> Hi All,
> I installed queens on centos 7.
> Heat seems to work fine with templates I wrote but when I create 
> magnum
> cluster
> I often face with db deadlock in heat-engine log:
The stacktrace below is in stack delete, so do you mean that the 
problem
occurs when deleting a Magnum cluster, or can it occur any time 
with a
magnum cluster?
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource 
> Traceback (most
> recent call last):
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
> "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 
> 918, in
> _action_recorder
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     yield
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
> "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 
> 2035,
> in delete
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     
> *action_args)
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
> "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 
> 346,
> in wrapper
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     step =
> next(subtask)
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
> "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 
> 977, in
> action_handler_task
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     done =
> check(handler_data)
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
> "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", 
>
> line 587, in check_delete_complete
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     return
> self._check_status_complete(self.DELETE)
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource   File
> "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", 
>
> line 454, in _check_status_complete
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource     
> action=action)
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource
> ResourceFailure: resources[0]: (pymysql.err.InternalError) (1213,
> u'Deadlock found when trying to get lock; try restarting 
> transaction')
> (Background on this error at: http://sqlalche.me/e/2j85)
> 2018-12-13 11:48:46.016 89597 ERROR heat.engine.resource
> 2018-12-13 11:48:46.030 89597 INFO heat.engine.stack
> [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default]
> Stack DELETE FAILED
> (swarm-clustergp27-ebhsalhb4bop-swarm_primary_master-tzgkh3ncmymw):
> Resource DELETE failed: resources[0]: (pymysql.err.InternalError) 
> (1213,
> u'Deadlock found when trying to get lock; try restarting 
> transaction')
> (Background on this error at: http://sqlalche.me/e/2j85)
> 2018-12-13 11:48:46.844 89595 INFO heat.engine.resource
> [req-9a43184f-77d8-4fad-ab9e-2b9826c10b70 - admin - default default]
> DELETE: ResourceGroup "swarm_primary_master"
> [6984db3c-3ac1-4afc-901f-21b3e7f230a7] Stack
> "swarm-clustergp27-ebhsalhb4bop" 
> [b43256fa-52e2-4613-ac15-63fd9340a8be]
>
>
> I read this issue was solved in heat version 10 but seems not.
The patch for bug 1732969, which is in Queens, is probably the fix
you're referring to: https://review.openstack.org/#/c/521170/1
The traceback in that bug included the SQL statement that was 
failing.
I'm not sure if that isn't included in your traceback because 
SQLAlchemy
didn't report it, or if it's because that traceback is actually 
from the
parent stack. If you have a traceback from the original failure in 
the
child stack that would be useful. If there's a way to turn on more
detailed reporting of errors in SQLAlchemy that would also be useful.
Since this is a delete, it's possible that we need a 
retry-on-deadlock
on the resource_delete() function also (though resource_update() is
actually used more during a delete to update the status).
- ZB