[cinder] Deadlock found when trying to get lock; try restarting transaction
Hi, in a customer cluster (Victoria, Galera cluster on 3 control nodes) we're seeing failing pipeline deployments from time to time when cinder is instructed to create multiple volumes at once. This is the error message: ---snip--- 2024-08-12 15:01:34.762 33307 WARNING oslo_db.sqlalchemy.exc_filters [req-aa5505d3-167a-4096-9311-36b10deebcc1 049f5ea05bd14c019aeab37d3cff4ffc ed22c592548e4903b9af541bb158c6fe - - -] DB exception wrapped.: sqlalchemy.exc.ResourceClosedError: This Connection is closed ... 2024-08-12 15:01:34.762 33307 ERROR oslo_db.sqlalchemy.exc_filters pymysql.err.InternalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction') ... 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 476, in _revalidate_connection 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server raise exc.ResourceClosedError("This Connection is closed") 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server sqlalchemy.exc.DBAPIError: (sqlalchemy.exc.ResourceClosedError) This Connection is closed 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server (Background on this error at: http://sqlalche.me/e/13/dbapi) ---snip--- I found this bug [1] with a fix for Pike, so Victoria already has that fix, but the error still blocks some deployments, leaving volumes in "creating" state which has to be cleaned up manually. I can't find much else on this, am I missing something? Any pointers would be highly appreciated! Thanks! Eugen [1] https://bugs.launchpad.net/cinder/+bug/1789106
Just one more note: I see the deadlock messages for all cinder services, cinder-api, cinder-scheduler, cinder-backup (which isn't even in use) and cinder-volume. nova-api contains those deadlock messages as well, so this might be a mariadb/galera issue? I'm not sure yet, I'll try to find out more. Zitat von Eugen Block <eblock@nde.ag>:
Hi,
in a customer cluster (Victoria, Galera cluster on 3 control nodes) we're seeing failing pipeline deployments from time to time when cinder is instructed to create multiple volumes at once. This is the error message:
---snip--- 2024-08-12 15:01:34.762 33307 WARNING oslo_db.sqlalchemy.exc_filters [req-aa5505d3-167a-4096-9311-36b10deebcc1 049f5ea05bd14c019aeab37d3cff4ffc ed22c592548e4903b9af541bb158c6fe - - -] DB exception wrapped.: sqlalchemy.exc.ResourceClosedError: This Connection is closed ... 2024-08-12 15:01:34.762 33307 ERROR oslo_db.sqlalchemy.exc_filters pymysql.err.InternalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction') ... 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 476, in _revalidate_connection 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server raise exc.ResourceClosedError("This Connection is closed") 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server sqlalchemy.exc.DBAPIError: (sqlalchemy.exc.ResourceClosedError) This Connection is closed 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server (Background on this error at: http://sqlalche.me/e/13/dbapi) ---snip---
I found this bug [1] with a fix for Pike, so Victoria already has that fix, but the error still blocks some deployments, leaving volumes in "creating" state which has to be cleaned up manually. I can't find much else on this, am I missing something? Any pointers would be highly appreciated!
Thanks! Eugen
Hi, Are you sending all the write requests on the same node? If not you should otherwise you will inevitably fall in that case where 2 write requests are coming on 2 different nodes at the same time which is causing deadlock issues. -- PS Eugen Block <eblock@nde.ag> wrote on lun. [2024-août-12 14:15:47 +0000]:
Just one more note: I see the deadlock messages for all cinder services, cinder-api, cinder-scheduler, cinder-backup (which isn't even in use) and cinder-volume. nova-api contains those deadlock messages as well, so this might be a mariadb/galera issue? I'm not sure yet, I'll try to find out more.
Zitat von Eugen Block <eblock@nde.ag>:
Hi,
in a customer cluster (Victoria, Galera cluster on 3 control nodes) we're seeing failing pipeline deployments from time to time when cinder is instructed to create multiple volumes at once. This is the error message:
---snip--- 2024-08-12 15:01:34.762 33307 WARNING oslo_db.sqlalchemy.exc_filters [req-aa5505d3-167a-4096-9311-36b10deebcc1 049f5ea05bd14c019aeab37d3cff4ffc ed22c592548e4903b9af541bb158c6fe - - -] DB exception wrapped.: sqlalchemy.exc.ResourceClosedError: This Connection is closed ... 2024-08-12 15:01:34.762 33307 ERROR oslo_db.sqlalchemy.exc_filters pymysql.err.InternalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction') ... 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 476, in _revalidate_connection 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server raise exc.ResourceClosedError("This Connection is closed") 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server sqlalchemy.exc.DBAPIError: (sqlalchemy.exc.ResourceClosedError) This Connection is closed 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server (Background on this error at: http://sqlalche.me/e/13/dbapi) ---snip---
I found this bug [1] with a fix for Pike, so Victoria already has that fix, but the error still blocks some deployments, leaving volumes in "creating" state which has to be cleaned up manually. I can't find much else on this, am I missing something? Any pointers would be highly appreciated!
Thanks! Eugen
-- Pierre-Samuel Le Stang
Oh I think you’re right, I did some tests when they had severe sql issues but I might have forgotten to turn the load balancing off when the performance was restored. Thanks for your quick response, I’ll check it later. Zitat von Pierre-Samuel LE STANG <pierre-samuel.le-stang@ovhcloud.com>:
Hi,
Are you sending all the write requests on the same node? If not you should otherwise you will inevitably fall in that case where 2 write requests are coming on 2 different nodes at the same time which is causing deadlock issues.
-- PS
Eugen Block <eblock@nde.ag> wrote on lun. [2024-août-12 14:15:47 +0000]:
Just one more note: I see the deadlock messages for all cinder services, cinder-api, cinder-scheduler, cinder-backup (which isn't even in use) and cinder-volume. nova-api contains those deadlock messages as well, so this might be a mariadb/galera issue? I'm not sure yet, I'll try to find out more.
Zitat von Eugen Block <eblock@nde.ag>:
Hi,
in a customer cluster (Victoria, Galera cluster on 3 control nodes) we're seeing failing pipeline deployments from time to time when cinder is instructed to create multiple volumes at once. This is the error message:
---snip--- 2024-08-12 15:01:34.762 33307 WARNING oslo_db.sqlalchemy.exc_filters [req-aa5505d3-167a-4096-9311-36b10deebcc1 049f5ea05bd14c019aeab37d3cff4ffc ed22c592548e4903b9af541bb158c6fe - - -] DB exception wrapped.: sqlalchemy.exc.ResourceClosedError: This Connection is closed ... 2024-08-12 15:01:34.762 33307 ERROR oslo_db.sqlalchemy.exc_filters pymysql.err.InternalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction') ... 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 476, in _revalidate_connection 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server raise exc.ResourceClosedError("This Connection is closed") 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server sqlalchemy.exc.DBAPIError: (sqlalchemy.exc.ResourceClosedError) This Connection is closed 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server (Background on this error at: http://sqlalche.me/e/13/dbapi) ---snip---
I found this bug [1] with a fix for Pike, so Victoria already has that fix, but the error still blocks some deployments, leaving volumes in "creating" state which has to be cleaned up manually. I can't find much else on this, am I missing something? Any pointers would be highly appreciated!
Thanks! Eugen
-- Pierre-Samuel Le Stang
On Mon, 2024-08-12 at 16:58 +0000, Eugen Block wrote:
Oh I think you’re right, I did some tests when they had severe sql issues but I might have forgotten to turn the load balancing off when the performance was restored. Thanks for your quick response, I’ll check it later. openstack in general cannot run in galera Active Active mode as the replication btween galera nodes happens asynchronously and result in the openstack services reciving stale read that can result in duplciat allcoation or db deadlocks
as such we don't officially supprot that toplogy. you might be able to masks this by setting mysql_wsrep_sync_wait = 1 to force a sync on every read. that has performance implications we considered doing this when writign our new downstram installer tool and decied we coudl not supprot the active active toplogy in production safely. https://github.com/openstack-k8s-operators/nova-operator/commit/ab95f150cbec... there is some context there when we were evaluating that as a work around but we reverted it and went to an active passive mode instead. if you are using galera please ensure that write only go to one galera instance or you will fundamentally violate our implic db requirement that ACID is not broken. galera breaks the C element when used in Active Active mode as we cannot rely on atomic transactions to be consist across reads to other cluster members. that makes that unsupproted form an os db/openstack service perspective.
Zitat von Pierre-Samuel LE STANG <pierre-samuel.le-stang@ovhcloud.com>:
Hi,
Are you sending all the write requests on the same node? If not you should otherwise you will inevitably fall in that case where 2 write requests are coming on 2 different nodes at the same time which is causing deadlock issues.
-- PS
Eugen Block <eblock@nde.ag> wrote on lun. [2024-août-12 14:15:47 +0000]:
Just one more note: I see the deadlock messages for all cinder services, cinder-api, cinder-scheduler, cinder-backup (which isn't even in use) and cinder-volume. nova-api contains those deadlock messages as well, so this might be a mariadb/galera issue? I'm not sure yet, I'll try to find out more.
Zitat von Eugen Block <eblock@nde.ag>:
Hi,
in a customer cluster (Victoria, Galera cluster on 3 control nodes) we're seeing failing pipeline deployments from time to time when cinder is instructed to create multiple volumes at once. This is the error message:
---snip--- 2024-08-12 15:01:34.762 33307 WARNING oslo_db.sqlalchemy.exc_filters [req-aa5505d3-167a-4096-9311-36b10deebcc1 049f5ea05bd14c019aeab37d3cff4ffc ed22c592548e4903b9af541bb158c6fe - - -] DB exception wrapped.: sqlalchemy.exc.ResourceClosedError: This Connection is closed ... 2024-08-12 15:01:34.762 33307 ERROR oslo_db.sqlalchemy.exc_filters pymysql.err.InternalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction') ... 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 476, in _revalidate_connection 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server raise exc.ResourceClosedError("This Connection is closed") 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server sqlalchemy.exc.DBAPIError: (sqlalchemy.exc.ResourceClosedError) This Connection is closed 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server (Background on this error at: http://sqlalche.me/e/13/dbapi) ---snip---
I found this bug [1] with a fix for Pike, so Victoria already has that fix, but the error still blocks some deployments, leaving volumes in "creating" state which has to be cleaned up manually. I can't find much else on this, am I missing something? Any pointers would be highly appreciated!
Thanks! Eugen
-- Pierre-Samuel Le Stang
Yes, of course, I totally understand, I just turned it back (as it never was supposed to be a persistent setting). Thanks for the details, Sean! Thanks! Eugen Zitat von smooney@redhat.com:
Oh I think you’re right, I did some tests when they had severe sql issues but I might have forgotten to turn the load balancing off when the performance was restored. Thanks for your quick response, I’ll check it later. openstack in general cannot run in galera Active Active mode as the replication btween galera nodes happens asynchronously and result in
On Mon, 2024-08-12 at 16:58 +0000, Eugen Block wrote: the openstack services reciving stale read that can result in duplciat allcoation or db deadlocks
as such we don't officially supprot that toplogy.
you might be able to masks this by setting
mysql_wsrep_sync_wait = 1 to force a sync on every read.
that has performance implications
we considered doing this when writign our new downstram installer tool and decied we coudl not supprot the active active toplogy in production safely.
https://github.com/openstack-k8s-operators/nova-operator/commit/ab95f150cbec...
there is some context there when we were evaluating that as a work around but we reverted it and went to an active passive mode instead.
if you are using galera please ensure that write only go to one galera instance or you will fundamentally violate our implic db requirement that ACID is not broken. galera breaks the C element when used in Active Active mode as we cannot rely on atomic transactions to be consist across reads to other cluster members. that makes that unsupproted form an os db/openstack service perspective.
Zitat von Pierre-Samuel LE STANG <pierre-samuel.le-stang@ovhcloud.com>:
Hi,
Are you sending all the write requests on the same node? If not you should otherwise you will inevitably fall in that case where 2 write requests are coming on 2 different nodes at the same time which is causing deadlock issues.
-- PS
Eugen Block <eblock@nde.ag> wrote on lun. [2024-août-12 14:15:47 +0000]:
Just one more note: I see the deadlock messages for all cinder services, cinder-api, cinder-scheduler, cinder-backup (which isn't even in use) and cinder-volume. nova-api contains those deadlock messages as well, so this might be a mariadb/galera issue? I'm not sure yet, I'll try to find out more.
Zitat von Eugen Block <eblock@nde.ag>:
Hi,
in a customer cluster (Victoria, Galera cluster on 3 control nodes) we're seeing failing pipeline deployments from time to time when cinder is instructed to create multiple volumes at once. This is the error message:
---snip--- 2024-08-12 15:01:34.762 33307 WARNING oslo_db.sqlalchemy.exc_filters [req-aa5505d3-167a-4096-9311-36b10deebcc1 049f5ea05bd14c019aeab37d3cff4ffc ed22c592548e4903b9af541bb158c6fe - - -] DB exception wrapped.: sqlalchemy.exc.ResourceClosedError: This Connection is closed ... 2024-08-12 15:01:34.762 33307 ERROR oslo_db.sqlalchemy.exc_filters pymysql.err.InternalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction') ... 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 476, in _revalidate_connection 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server raise exc.ResourceClosedError("This Connection is closed") 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server sqlalchemy.exc.DBAPIError: (sqlalchemy.exc.ResourceClosedError) This Connection is closed 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server (Background on this error at: http://sqlalche.me/e/13/dbapi) ---snip---
I found this bug [1] with a fix for Pike, so Victoria already has that fix, but the error still blocks some deployments, leaving volumes in "creating" state which has to be cleaned up manually. I can't find much else on this, am I missing something? Any pointers would be highly appreciated!
Thanks! Eugen
-- Pierre-Samuel Le Stang
participants (3)
-
Eugen Block
-
Pierre-Samuel LE STANG
-
smooney@redhat.com