Oh I think you’re right, I did some tests when they had severe sql issues but I might have forgotten to turn the load balancing off when the performance was restored. Thanks for your quick response, I’ll check it later. Zitat von Pierre-Samuel LE STANG <pierre-samuel.le-stang@ovhcloud.com>:
Hi,
Are you sending all the write requests on the same node? If not you should otherwise you will inevitably fall in that case where 2 write requests are coming on 2 different nodes at the same time which is causing deadlock issues.
-- PS
Eugen Block <eblock@nde.ag> wrote on lun. [2024-août-12 14:15:47 +0000]:
Just one more note: I see the deadlock messages for all cinder services, cinder-api, cinder-scheduler, cinder-backup (which isn't even in use) and cinder-volume. nova-api contains those deadlock messages as well, so this might be a mariadb/galera issue? I'm not sure yet, I'll try to find out more.
Zitat von Eugen Block <eblock@nde.ag>:
Hi,
in a customer cluster (Victoria, Galera cluster on 3 control nodes) we're seeing failing pipeline deployments from time to time when cinder is instructed to create multiple volumes at once. This is the error message:
---snip--- 2024-08-12 15:01:34.762 33307 WARNING oslo_db.sqlalchemy.exc_filters [req-aa5505d3-167a-4096-9311-36b10deebcc1 049f5ea05bd14c019aeab37d3cff4ffc ed22c592548e4903b9af541bb158c6fe - - -] DB exception wrapped.: sqlalchemy.exc.ResourceClosedError: This Connection is closed ... 2024-08-12 15:01:34.762 33307 ERROR oslo_db.sqlalchemy.exc_filters pymysql.err.InternalError: (1213, 'Deadlock found when trying to get lock; try restarting transaction') ... 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 476, in _revalidate_connection 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server raise exc.ResourceClosedError("This Connection is closed") 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server sqlalchemy.exc.DBAPIError: (sqlalchemy.exc.ResourceClosedError) This Connection is closed 2024-08-12 15:01:34.766 33307 ERROR oslo_messaging.rpc.server (Background on this error at: http://sqlalche.me/e/13/dbapi) ---snip---
I found this bug [1] with a fix for Pike, so Victoria already has that fix, but the error still blocks some deployments, leaving volumes in "creating" state which has to be cleaned up manually. I can't find much else on this, am I missing something? Any pointers would be highly appreciated!
Thanks! Eugen
-- Pierre-Samuel Le Stang