[mariadb][keystone][blazar] Deadlock error on Galera cluster
Hi community o/ Suddenly, I noticed that the login on the dashboard wasn't working, so I checked the keystone logs. I saw some entries like that: 2025-07-10 00:53:45.120 23 ERROR sqlalchemy.pool.impl.QueuePool oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') Nova-conductor was also showing messages like these. At the same time, blazer-manager was logging after every _process_event calling: ERROR sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00 I do not know if it's related to the _process_event method that batches events concurrently[1], and the DB was locked. Well... I increased the max_pool_size from 1 to 2 on blazar-manager, and that message stops showing. Does anyone know what can cause this deadlock generally? I'd like to ask another question: I have three controllers in my environment. Why do controllers with no active VIP address bind connections on the MySQL port, for example: tcp ESTAB 0 0 10.0.0.105:3306 10.0.0.107:50642 users:(("mariadbd",pid=977010,fd=1089)) [1] https://github.com/openstack/blazar/blob/stable/2024.1/blazar/manager/servic... Best regards.
Hello, Not sure if this is the root cause of your problem, but that could: - https://mariadb.com/resources/blog/isolation-level-violation-testing-and-deb... - https://jira.mariadb.org/browse/MDEV-35124 End of last week we noticed strange behaviours due to this new transactional model, see https://bugs.launchpad.net/nova/+bug/2116186/. Le lun. 14 juil. 2025 à 19:43, Winicius Allan <winiciusab12@gmail.com> a écrit :
Hi community o/
Suddenly, I noticed that the login on the dashboard wasn't working, so I checked the keystone logs. I saw some entries like that:
2025-07-10 00:53:45.120 23 ERROR sqlalchemy.pool.impl.QueuePool oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
Nova-conductor was also showing messages like these.
At the same time, blazer-manager was logging after every _process_event calling:
ERROR sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00
I do not know if it's related to the _process_event method that batches events concurrently[1], and the DB was locked. Well... I increased the max_pool_size from 1 to 2 on blazar-manager, and that message stops showing. Does anyone know what can cause this deadlock generally?
I'd like to ask another question: I have three controllers in my environment. Why do controllers with no active VIP address bind connections on the MySQL port, for example:
tcp ESTAB 0 0 10.0.0.105:3306 10.0.0.107:50642 users:(("mariadbd",pid=977010,fd=1089))
[1] https://github.com/openstack/blazar/blob/stable/2024.1/blazar/manager/servic...
Best regards.
-- Hervé Beraud Principal Software Engineer at Red Hat irc: hberaud https://github.com/4383/
Hi Herve, thanks for your response. Since I'm using Antelope release, the mariadb version is slightly older and checking this variable value it is turned off $ docker exec -it mariadb mariadb -V mariadb Ver 15.1 Distrib 10.11.11-MariaDB, for debian-linux-gnu (x86_64) using EditLine wrapper MariaDB [(none)]> show session variables like '%snapshot%'; +---------------------------+-------+ | Variable_name | Value | +---------------------------+-------+ | innodb_snapshot_isolation | OFF | +---------------------------+-------+ I don't think this could be the root cause of the problem, but thank you for the reply! I'm still digging in on the problem to find a possible root cause. Regards. Em ter., 15 de jul. de 2025 às 05:55, Herve Beraud <hberaud@redhat.com> escreveu:
Hello,
Not sure if this is the root cause of your problem, but that could: - https://mariadb.com/resources/blog/isolation-level-violation-testing-and-deb... - https://jira.mariadb.org/browse/MDEV-35124
End of last week we noticed strange behaviours due to this new transactional model, see https://bugs.launchpad.net/nova/+bug/2116186/.
Le lun. 14 juil. 2025 à 19:43, Winicius Allan <winiciusab12@gmail.com> a écrit :
Hi community o/
Suddenly, I noticed that the login on the dashboard wasn't working, so I checked the keystone logs. I saw some entries like that:
2025-07-10 00:53:45.120 23 ERROR sqlalchemy.pool.impl.QueuePool oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
Nova-conductor was also showing messages like these.
At the same time, blazer-manager was logging after every _process_event calling:
ERROR sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00
I do not know if it's related to the _process_event method that batches events concurrently[1], and the DB was locked. Well... I increased the max_pool_size from 1 to 2 on blazar-manager, and that message stops showing. Does anyone know what can cause this deadlock generally?
I'd like to ask another question: I have three controllers in my environment. Why do controllers with no active VIP address bind connections on the MySQL port, for example:
tcp ESTAB 0 0 10.0.0.105:3306 10.0.0.107:50642 users:(("mariadbd",pid=977010,fd=1089))
[1] https://github.com/openstack/blazar/blob/stable/2024.1/blazar/manager/servic...
Best regards.
-- Hervé Beraud Principal Software Engineer at Red Hat irc: hberaud https://github.com/4383/
And happened again... Does anyone have a clue how to troubleshoot this? Apparently, the MariaDB service got locked: nova conductor shows on the logs 2025-07-17 18:28:07.567 30 ERROR nova.servicegroup.drivers.db pymysql.err.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction') And only nova service, login on the dashboard with Keystone, it was all good. Regards. Em seg., 14 de jul. de 2025 às 14:42, Winicius Allan <winiciusab12@gmail.com> escreveu:
Hi community o/
Suddenly, I noticed that the login on the dashboard wasn't working, so I checked the keystone logs. I saw some entries like that:
2025-07-10 00:53:45.120 23 ERROR sqlalchemy.pool.impl.QueuePool oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
Nova-conductor was also showing messages like these.
At the same time, blazer-manager was logging after every _process_event calling:
ERROR sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00
I do not know if it's related to the _process_event method that batches events concurrently[1], and the DB was locked. Well... I increased the max_pool_size from 1 to 2 on blazar-manager, and that message stops showing. Does anyone know what can cause this deadlock generally?
I'd like to ask another question: I have three controllers in my environment. Why do controllers with no active VIP address bind connections on the MySQL port, for example:
tcp ESTAB 0 0 10.0.0.105:3306 10.0.0.107:50642 users:(("mariadbd",pid=977010,fd=1089))
[1] https://github.com/openstack/blazar/blob/stable/2024.1/blazar/manager/servic...
Best regards.
We had similar issue in our environment and eventually it was due to the aggressive nova archive and purge job and the number of rows to be moved to shadow tables. we also reduced the number of rows to be archived and purged We also increased the innodb_lock_wait_timeout to 90 seconds To troubleshoot and find which transaction is causing the db lock. Set “innodb_print_all_deadlocks = ON” in galera.cnf After it was tackled, we got rid of the DB lock wait timeout Regards, Venkat On Fri, Jul 18, 2025 at 02:56 Winicius Allan <winiciusab12@gmail.com> wrote:
And happened again... Does anyone have a clue how to troubleshoot this?
Apparently, the MariaDB service got locked: nova conductor shows on the logs
2025-07-17 18:28:07.567 30 ERROR nova.servicegroup.drivers.db pymysql.err.OperationalError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
And only nova service, login on the dashboard with Keystone, it was all good.
Regards.
Em seg., 14 de jul. de 2025 às 14:42, Winicius Allan < winiciusab12@gmail.com> escreveu:
Hi community o/
Suddenly, I noticed that the login on the dashboard wasn't working, so I checked the keystone logs. I saw some entries like that:
2025-07-10 00:53:45.120 23 ERROR sqlalchemy.pool.impl.QueuePool oslo_db.exception.DBDeadlock: (pymysql.err.OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction')
Nova-conductor was also showing messages like these.
At the same time, blazer-manager was logging after every _process_event calling:
ERROR sqlalchemy.exc.TimeoutError: QueuePool limit of size 1 overflow 50 reached, connection timed out, timeout 30.00
I do not know if it's related to the _process_event method that batches events concurrently[1], and the DB was locked. Well... I increased the max_pool_size from 1 to 2 on blazar-manager, and that message stops showing. Does anyone know what can cause this deadlock generally?
I'd like to ask another question: I have three controllers in my environment. Why do controllers with no active VIP address bind connections on the MySQL port, for example:
tcp ESTAB 0 0 10.0.0.105:3306 10.0.0.107:50642 users:(("mariadbd",pid=977010,fd=1089))
[1] https://github.com/openstack/blazar/blob/stable/2024.1/blazar/manager/servic...
Best regards.
participants (3)
-
Herve Beraud
-
venkatakrishnan ar
-
Winicius Allan