I'm experimenting with Galera in my Rocky openstack-ansible dev cluster, and I'm finding that the default haproxy config values don't seem to work. Finding the correct values is a lot of work. For example, I spent this morning experimenting with different values for "timeout client" in /etc/haproxy/haproxy.cfg. The default is 1m, and with the default set I see this error in /var/log/nova/nova-scheduler.log on the controllers: 2020-01-17 13:54:26.059 443358 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8) There are several timeout values in /etc/haproxy/haproxy.cfg. These are the values we started with: stats timeout 30s timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 1m timeout server 1m timeout check 10s At first I changed them all to 30m. This stopped the "Lost connection" error in nova-scheduler.log. Then, one at a time, I changed them back to the default. When I got to "timeout client" I found that setting it back to 1m caused the errors to start again. I changed it back and forth and found that 4 minutes causes errors, and 6m stops them, so I left it at 6m. These are my active variables: root@us01odc-dev2-ctrl1:/etc/mysql# mysql -e 'show variables;'|grep timeout connect_timeout 20 deadlock_timeout_long 50000000 deadlock_timeout_short 10000 delayed_insert_timeout 300 idle_readonly_transaction_timeout 0 idle_transaction_timeout 0 idle_write_transaction_timeout 0 innodb_flush_log_at_timeout 1 innodb_lock_wait_timeout 50 innodb_rollback_on_timeout OFF interactive_timeout 28800 lock_wait_timeout 86400 net_read_timeout 30 net_write_timeout 60 rpl_semi_sync_master_timeout 10000 rpl_semi_sync_slave_kill_conn_timeout 5 slave_net_timeout 60 thread_pool_idle_timeout 60 wait_timeout 3600 So it looks like the value of "timeout client" in haproxy.cfg needs to match or exceed the value of "wait_timeout" in mysql. Also in nova.conf I see "#connection_recycle_time = 3600" - I need to experiment to see how that value interacts with the timeouts in the other config files. Is this the best way to find the correct config values? It seems like there should be a document that talks about these timeouts and how to set them (or maybe more generally how the different timeout settings in the various config files interact). Does that document exist? If not, maybe I could write one, since I have to figure out the correct values anyway.
Hi, I'm pretty sure you'll have to figure it out yourself. I always found the deployment guides quite good, I got my cloud running without major issues. But when it comes to HA configuration the guide lacks many information. I had to fiure out many details on my own, though haproxy is currently not in use here.
So it looks like the value of "timeout client" in haproxy.cfg needs to match or exceed the value of "wait_timeout" in mysql.
Although I'm not entirely sure I tend to agree with you. Dealing with a Ceph RGW deployment I encountered a similar issue and had to increase some timeout values to get it working. I'm convinced that many people would appreciate if you created a doc for haproxy. Regards, Eugen Zitat von Albert Braden <Albert.Braden@synopsys.com>:
I'm experimenting with Galera in my Rocky openstack-ansible dev cluster, and I'm finding that the default haproxy config values don't seem to work. Finding the correct values is a lot of work. For example, I spent this morning experimenting with different values for "timeout client" in /etc/haproxy/haproxy.cfg. The default is 1m, and with the default set I see this error in /var/log/nova/nova-scheduler.log on the controllers:
2020-01-17 13:54:26.059 443358 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
There are several timeout values in /etc/haproxy/haproxy.cfg. These are the values we started with:
stats timeout 30s timeout http-request 10s timeout queue 1m timeout connect 10s timeout client 1m timeout server 1m timeout check 10s
At first I changed them all to 30m. This stopped the "Lost connection" error in nova-scheduler.log. Then, one at a time, I changed them back to the default. When I got to "timeout client" I found that setting it back to 1m caused the errors to start again. I changed it back and forth and found that 4 minutes causes errors, and 6m stops them, so I left it at 6m.
These are my active variables:
root@us01odc-dev2-ctrl1:/etc/mysql# mysql -e 'show variables;'|grep timeout connect_timeout 20 deadlock_timeout_long 50000000 deadlock_timeout_short 10000 delayed_insert_timeout 300 idle_readonly_transaction_timeout 0 idle_transaction_timeout 0 idle_write_transaction_timeout 0 innodb_flush_log_at_timeout 1 innodb_lock_wait_timeout 50 innodb_rollback_on_timeout OFF interactive_timeout 28800 lock_wait_timeout 86400 net_read_timeout 30 net_write_timeout 60 rpl_semi_sync_master_timeout 10000 rpl_semi_sync_slave_kill_conn_timeout 5 slave_net_timeout 60 thread_pool_idle_timeout 60 wait_timeout 3600
So it looks like the value of "timeout client" in haproxy.cfg needs to match or exceed the value of "wait_timeout" in mysql. Also in nova.conf I see "#connection_recycle_time = 3600" - I need to experiment to see how that value interacts with the timeouts in the other config files.
Is this the best way to find the correct config values? It seems like there should be a document that talks about these timeouts and how to set them (or maybe more generally how the different timeout settings in the various config files interact). Does that document exist? If not, maybe I could write one, since I have to figure out the correct values anyway.
On Fri, Jan 17, 2020 at 5:20 PM Albert Braden <Albert.Braden@synopsys.com> wrote:
I’m experimenting with Galera in my Rocky openstack-ansible dev cluster, and I’m finding that the default haproxy config values don’t seem to work. Finding the correct values is a lot of work. For example, I spent this morning experimenting with different values for “timeout client” in /etc/haproxy/haproxy.cfg. The default is 1m, and with the default set I see this error in /var/log/nova/nova-scheduler.log on the controllers:
2020-01-17 13:54:26.059 443358 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
There are several timeout values in /etc/haproxy/haproxy.cfg. These are the values we started with:
stats timeout 30s
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout check 10s
At first I changed them all to 30m. This stopped the “Lost connection” error in nova-scheduler.log. Then, one at a time, I changed them back to the default. When I got to “timeout client” I found that setting it back to 1m caused the errors to start again. I changed it back and forth and found that 4 minutes causes errors, and 6m stops them, so I left it at 6m.
These are my active variables:
root@us01odc-dev2-ctrl1:/etc/mysql# mysql -e 'show variables;'|grep timeout
connect_timeout 20
deadlock_timeout_long 50000000
deadlock_timeout_short 10000
delayed_insert_timeout 300
idle_readonly_transaction_timeout 0
idle_transaction_timeout 0
idle_write_transaction_timeout 0
innodb_flush_log_at_timeout 1
innodb_lock_wait_timeout 50
innodb_rollback_on_timeout OFF
interactive_timeout 28800
lock_wait_timeout 86400
net_read_timeout 30
net_write_timeout 60
rpl_semi_sync_master_timeout 10000
rpl_semi_sync_slave_kill_conn_timeout 5
slave_net_timeout 60
thread_pool_idle_timeout 60
wait_timeout 3600
So it looks like the value of “timeout client” in haproxy.cfg needs to match or exceed the value of “wait_timeout” in mysql. Also in nova.conf I see “#connection_recycle_time = 3600” – I need to experiment to see how that value interacts with the timeouts in the other config files.
Is this the best way to find the correct config values? It seems like there should be a document that talks about these timeouts and how to set them (or maybe more generally how the different timeout settings in the various config files interact). Does that document exist? If not, maybe I could write one, since I have to figure out the correct values anyway.
Is your cluster pretty idle? I've never seen that happen in any environments before... -- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. https://vexxhost.com
I can share our haproxt settings on monday, but you need to make sure that haproxy to at least match the Oslo config which I believe is 3600s, but I think in theory something like keepalived is better for galerara. btw pretty sure both client and server needs 3600s. Basically openstack recycles the connection every hour by default. So you need to make sure that haproxy does not close it before that if it’s idle. Sent from my iPhone
On Jan 17, 2020, at 7:24 PM, Mohammed Naser <mnaser@vexxhost.com> wrote:
On Fri, Jan 17, 2020 at 5:20 PM Albert Braden <Albert.Braden@synopsys.com> wrote:
I’m experimenting with Galera in my Rocky openstack-ansible dev cluster, and I’m finding that the default haproxy config values don’t seem to work. Finding the correct values is a lot of work. For example, I spent this morning experimenting with different values for “timeout client” in /etc/haproxy/haproxy.cfg. The default is 1m, and with the default set I see this error in /var/log/nova/nova-scheduler.log on the controllers:
2020-01-17 13:54:26.059 443358 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: https://urldefense.com/v3/__http://sqlalche.me/e/e3q8__;!!Ci6f514n9QsL8ck!39... )
There are several timeout values in /etc/haproxy/haproxy.cfg. These are the values we started with:
stats timeout 30s
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout check 10s
At first I changed them all to 30m. This stopped the “Lost connection” error in nova-scheduler.log. Then, one at a time, I changed them back to the default. When I got to “timeout client” I found that setting it back to 1m caused the errors to start again. I changed it back and forth and found that 4 minutes causes errors, and 6m stops them, so I left it at 6m.
These are my active variables:
root@us01odc-dev2-ctrl1:/etc/mysql# mysql -e 'show variables;'|grep timeout
connect_timeout 20
deadlock_timeout_long 50000000
deadlock_timeout_short 10000
delayed_insert_timeout 300
idle_readonly_transaction_timeout 0
idle_transaction_timeout 0
idle_write_transaction_timeout 0
innodb_flush_log_at_timeout 1
innodb_lock_wait_timeout 50
innodb_rollback_on_timeout OFF
interactive_timeout 28800
lock_wait_timeout 86400
net_read_timeout 30
net_write_timeout 60
rpl_semi_sync_master_timeout 10000
rpl_semi_sync_slave_kill_conn_timeout 5
slave_net_timeout 60
thread_pool_idle_timeout 60
wait_timeout 3600
So it looks like the value of “timeout client” in haproxy.cfg needs to match or exceed the value of “wait_timeout” in mysql. Also in nova.conf I see “#connection_recycle_time = 3600” – I need to experiment to see how that value interacts with the timeouts in the other config files.
Is this the best way to find the correct config values? It seems like there should be a document that talks about these timeouts and how to set them (or maybe more generally how the different timeout settings in the various config files interact). Does that document exist? If not, maybe I could write one, since I have to figure out the correct values anyway.
Is your cluster pretty idle? I've never seen that happen in any environments before...
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. https://urldefense.com/v3/__https://vexxhost.com__;!!Ci6f514n9QsL8ck!39gvi32...
On Fri, Jan 17, 2020 at 10:37 PM Erik Olof Gunnar Andersson <eandersson@blizzard.com> wrote:
I can share our haproxt settings on monday, but you need to make sure that haproxy to at least match the Oslo config which I believe is 3600s, but I think in theory something like keepalived is better for galerara.
btw pretty sure both client and server needs 3600s. Basically openstack recycles the connection every hour by default. So you need to make sure that haproxy does not close it before that if it’s idle.
Indeed, this adds up to what we do in OSA https://opendev.org/openstack/openstack-ansible/src/branch/master/inventory/...
Sent from my iPhone
On Jan 17, 2020, at 7:24 PM, Mohammed Naser <mnaser@vexxhost.com> wrote:
On Fri, Jan 17, 2020 at 5:20 PM Albert Braden <Albert.Braden@synopsys.com> wrote:
I’m experimenting with Galera in my Rocky openstack-ansible dev cluster, and I’m finding that the default haproxy config values don’t seem to work. Finding the correct values is a lot of work. For example, I spent this morning experimenting with different values for “timeout client” in /etc/haproxy/haproxy.cfg. The default is 1m, and with the default set I see this error in /var/log/nova/nova-scheduler.log on the controllers:
2020-01-17 13:54:26.059 443358 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: https://urldefense.com/v3/__http://sqlalche.me/e/e3q8__;!!Ci6f514n9QsL8ck!39... )
There are several timeout values in /etc/haproxy/haproxy.cfg. These are the values we started with:
stats timeout 30s
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout check 10s
At first I changed them all to 30m. This stopped the “Lost connection” error in nova-scheduler.log. Then, one at a time, I changed them back to the default. When I got to “timeout client” I found that setting it back to 1m caused the errors to start again. I changed it back and forth and found that 4 minutes causes errors, and 6m stops them, so I left it at 6m.
These are my active variables:
root@us01odc-dev2-ctrl1:/etc/mysql# mysql -e 'show variables;'|grep timeout
connect_timeout 20
deadlock_timeout_long 50000000
deadlock_timeout_short 10000
delayed_insert_timeout 300
idle_readonly_transaction_timeout 0
idle_transaction_timeout 0
idle_write_transaction_timeout 0
innodb_flush_log_at_timeout 1
innodb_lock_wait_timeout 50
innodb_rollback_on_timeout OFF
interactive_timeout 28800
lock_wait_timeout 86400
net_read_timeout 30
net_write_timeout 60
rpl_semi_sync_master_timeout 10000
rpl_semi_sync_slave_kill_conn_timeout 5
slave_net_timeout 60
thread_pool_idle_timeout 60
wait_timeout 3600
So it looks like the value of “timeout client” in haproxy.cfg needs to match or exceed the value of “wait_timeout” in mysql. Also in nova.conf I see “#connection_recycle_time = 3600” – I need to experiment to see how that value interacts with the timeouts in the other config files.
Is this the best way to find the correct config values? It seems like there should be a document that talks about these timeouts and how to set them (or maybe more generally how the different timeout settings in the various config files interact). Does that document exist? If not, maybe I could write one, since I have to figure out the correct values anyway.
Is your cluster pretty idle? I've never seen that happen in any environments before...
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. https://urldefense.com/v3/__https://vexxhost.com__;!!Ci6f514n9QsL8ck!39gvi32...
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. https://vexxhost.com
That would be fantastic; thanks! -----Original Message----- From: Erik Olof Gunnar Andersson <eandersson@blizzard.com> Sent: Friday, January 17, 2020 7:37 PM To: Mohammed Naser <mnaser@vexxhost.com> Cc: Albert Braden <albertb@synopsys.com>; openstack-discuss@lists.openstack.org Subject: Re: Galera config values I can share our haproxt settings on monday, but you need to make sure that haproxy to at least match the Oslo config which I believe is 3600s, but I think in theory something like keepalived is better for galerara. btw pretty sure both client and server needs 3600s. Basically openstack recycles the connection every hour by default. So you need to make sure that haproxy does not close it before that if it’s idle. Sent from my iPhone
On Jan 17, 2020, at 7:24 PM, Mohammed Naser <mnaser@vexxhost.com> wrote:
On Fri, Jan 17, 2020 at 5:20 PM Albert Braden <Albert.Braden@synopsys.com> wrote:
I’m experimenting with Galera in my Rocky openstack-ansible dev cluster, and I’m finding that the default haproxy config values don’t seem to work. Finding the correct values is a lot of work. For example, I spent this morning experimenting with different values for “timeout client” in /etc/haproxy/haproxy.cfg. The default is 1m, and with the default set I see this error in /var/log/nova/nova-scheduler.log on the controllers:
2020-01-17 13:54:26.059 443358 ERROR oslo_db.sqlalchemy.engines DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT 1'] (Background on this error at: https://urldefense.com/v3/__http://sqlalche.me/e/e3q8__;!!Ci6f514n9QsL8ck!39... )
There are several timeout values in /etc/haproxy/haproxy.cfg. These are the values we started with:
stats timeout 30s
timeout http-request 10s
timeout queue 1m
timeout connect 10s
timeout client 1m
timeout server 1m
timeout check 10s
At first I changed them all to 30m. This stopped the “Lost connection” error in nova-scheduler.log. Then, one at a time, I changed them back to the default. When I got to “timeout client” I found that setting it back to 1m caused the errors to start again. I changed it back and forth and found that 4 minutes causes errors, and 6m stops them, so I left it at 6m.
These are my active variables:
root@us01odc-dev2-ctrl1:/etc/mysql# mysql -e 'show variables;'|grep timeout
connect_timeout 20
deadlock_timeout_long 50000000
deadlock_timeout_short 10000
delayed_insert_timeout 300
idle_readonly_transaction_timeout 0
idle_transaction_timeout 0
idle_write_transaction_timeout 0
innodb_flush_log_at_timeout 1
innodb_lock_wait_timeout 50
innodb_rollback_on_timeout OFF
interactive_timeout 28800
lock_wait_timeout 86400
net_read_timeout 30
net_write_timeout 60
rpl_semi_sync_master_timeout 10000
rpl_semi_sync_slave_kill_conn_timeout 5
slave_net_timeout 60
thread_pool_idle_timeout 60
wait_timeout 3600
So it looks like the value of “timeout client” in haproxy.cfg needs to match or exceed the value of “wait_timeout” in mysql. Also in nova.conf I see “#connection_recycle_time = 3600” – I need to experiment to see how that value interacts with the timeouts in the other config files.
Is this the best way to find the correct config values? It seems like there should be a document that talks about these timeouts and how to set them (or maybe more generally how the different timeout settings in the various config files interact). Does that document exist? If not, maybe I could write one, since I have to figure out the correct values anyway.
Is your cluster pretty idle? I've never seen that happen in any environments before...
-- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. https://urldefense.com/v3/__https://vexxhost.com__;!!Ci6f514n9QsL8ck!39gvi32...
participants (4)
-
Albert Braden
-
Erik Olof Gunnar Andersson
-
Eugen Block
-
Mohammed Naser