After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
Hello everyone. It's great to be back working on OpenStack again. I'm at Verisign now. I can hardly describe how happy I am to have an employer that does not attach nonsense to the bottom of my emails! We are rebuilding our clusters from Queens to Train. On the new Train clusters, customers are complaining that deleting a VM and then immediately creating a new one with the same name (via Terraform for example) intermittently results in a missing DNS record. We can duplicate the issue by building a VM with terraform, tainting it, and applying. Before applying the change, we see the DNS record in the recordset: $ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra | f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE | $ and we can pull it from the DNS server on the controllers: $ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89 openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89 openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89 After applying the change, we don't see it: $ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra | f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE | $ $ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done $ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra $ We see this in the logs: 2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'c70e693b4c47402db088c43a5a177134-openstack-terra-test-host.de...' for key 'unique_recordset'") 2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [SQL: INSERT INTO recordsets (id, version, created_at, zone_shard, tenant_id, zone_id, name, type, ttl, reverse_name) VALUES (%(id)s, %(version)s, %(created_at)s, %(zone_shard)s, %(tenant_id)s, %(zone_id)s, %(name)s, %(type)s, %(ttl)s, %(reverse_name)s)] 2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [parameters: {'id': 'dbbb904c347241a791aa01ca33a87b23', 'version': 1, 'created_at': datetime.datetime(2021, 10, 9, 1, 53, 44, 182652), 'zone_shard': 3184, 'tenant_id': '8d1c84082a044a53abe0d519ed9e8c60', 'zone_id': 'c70e693b4c47402db088c43a5a177134', 'name': 'openstack-terra-test-host.dev-ostck.dva3.vrsn.com.', 'type': 'A', 'ttl': None, 'reverse_name': '.moc.nsrv.3avd.kctso-ved.tsoh-tset-arret-kcatsnepo'}] It appears that Designate is trying to create the new record before the deletion of the old one finishes. Is anyone else seeing this on Train? The same set of actions doesn't cause this error in Queens. Do we need to change something in our Designate config, to make it wait until the old records are finished deleting before attempting to create the new ones?
Hi Albert, Have you configured your distributed lock manager for Designate? [coordination] backend_url = <DLM URL> Michael On Fri, Oct 8, 2021 at 7:38 PM Braden, Albert <abraden@verisign.com> wrote:
Hello everyone. It’s great to be back working on OpenStack again. I’m at Verisign now. I can hardly describe how happy I am to have an employer that does not attach nonsense to the bottom of my emails!
We are rebuilding our clusters from Queens to Train. On the new Train clusters, customers are complaining that deleting a VM and then immediately creating a new one with the same name (via Terraform for example) intermittently results in a missing DNS record. We can duplicate the issue by building a VM with terraform, tainting it, and applying.
Before applying the change, we see the DNS record in the recordset:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
and we can pull it from the DNS server on the controllers:
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
After applying the change, we don’t see it:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
$
We see this in the logs:
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'c70e693b4c47402db088c43a5a177134-openstack-terra-test-host.de...' for key 'unique_recordset'")
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [SQL: INSERT INTO recordsets (id, version, created_at, zone_shard, tenant_id, zone_id, name, type, ttl, reverse_name) VALUES (%(id)s, %(version)s, %(created_at)s, %(zone_shard)s, %(tenant_id)s, %(zone_id)s, %(name)s, %(type)s, %(ttl)s, %(reverse_name)s)]
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [parameters: {'id': 'dbbb904c347241a791aa01ca33a87b23', 'version': 1, 'created_at': datetime.datetime(2021, 10, 9, 1, 53, 44, 182652), 'zone_shard': 3184, 'tenant_id': '8d1c84082a044a53abe0d519ed9e8c60', 'zone_id': 'c70e693b4c47402db088c43a5a177134', 'name': 'openstack-terra-test-host.dev-ostck.dva3.vrsn.com.', 'type': 'A', 'ttl': None, 'reverse_name': '.moc.nsrv.3avd.kctso-ved.tsoh-tset-arret-kcatsnepo'}]
It appears that Designate is trying to create the new record before the deletion of the old one finishes.
Is anyone else seeing this on Train? The same set of actions doesn’t cause this error in Queens. Do we need to change something in our Designate config, to make it wait until the old records are finished deleting before attempting to create the new ones?
I think so. I see this: ansible/roles/designate/templates/designate.conf.j2:backend_url = {{ redis_connection_string }} ansible/group_vars/all.yml:redis_connection_string: "redis://{% for host in groups['redis'] %}{% if host == groups['redis'][0] %}admin:{{ redis_master_password }}@{{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}?sentinel=kolla{% else %}&sentinel_fallback={{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}{% endif %}{% endfor %}&db=0&socket_timeout=60&retry_on_timeout=yes" Did anything with the distributed lock manager between Queens and Train? -----Original Message----- From: Michael Johnson <johnsomor@gmail.com> Sent: Monday, October 11, 2021 1:15 PM To: Braden, Albert <abraden@verisign.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hi Albert, Have you configured your distributed lock manager for Designate? [coordination] backend_url = <DLM URL> Michael On Fri, Oct 8, 2021 at 7:38 PM Braden, Albert <abraden@verisign.com> wrote:
Hello everyone. It’s great to be back working on OpenStack again. I’m at Verisign now. I can hardly describe how happy I am to have an employer that does not attach nonsense to the bottom of my emails!
We are rebuilding our clusters from Queens to Train. On the new Train clusters, customers are complaining that deleting a VM and then immediately creating a new one with the same name (via Terraform for example) intermittently results in a missing DNS record. We can duplicate the issue by building a VM with terraform, tainting it, and applying.
Before applying the change, we see the DNS record in the recordset:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
and we can pull it from the DNS server on the controllers:
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
After applying the change, we don’t see it:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
$
We see this in the logs:
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'c70e693b4c47402db088c43a5a177134-openstack-terra-test-host.de...' for key 'unique_recordset'")
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [SQL: INSERT INTO recordsets (id, version, created_at, zone_shard, tenant_id, zone_id, name, type, ttl, reverse_name) VALUES (%(id)s, %(version)s, %(created_at)s, %(zone_shard)s, %(tenant_id)s, %(zone_id)s, %(name)s, %(type)s, %(ttl)s, %(reverse_name)s)]
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [parameters: {'id': 'dbbb904c347241a791aa01ca33a87b23', 'version': 1, 'created_at': datetime.datetime(2021, 10, 9, 1, 53, 44, 182652), 'zone_shard': 3184, 'tenant_id': '8d1c84082a044a53abe0d519ed9e8c60', 'zone_id': 'c70e693b4c47402db088c43a5a177134', 'name': 'openstack-terra-test-host.dev-ostck.dva3.vrsn.com.', 'type': 'A', 'ttl': None, 'reverse_name': '.moc.nsrv.3avd.kctso-ved.tsoh-tset-arret-kcatsnepo'}]
It appears that Designate is trying to create the new record before the deletion of the old one finishes.
Is anyone else seeing this on Train? The same set of actions doesn’t cause this error in Queens. Do we need to change something in our Designate config, to make it wait until the old records are finished deleting before attempting to create the new ones?
After investigating further, I realized that we're not running redis, and I think that means that redis_connection_string doesn't get set. Does this mean that we must run redis, or is there a workaround? -----Original Message----- From: Braden, Albert Sent: Monday, October 11, 2021 2:48 PM To: 'johnsomor@gmail.com' <johnsomor@gmail.com> Cc: 'openstack-discuss@lists.openstack.org' <openstack-discuss@lists.openstack.org> Subject: RE: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail I think so. I see this: ansible/roles/designate/templates/designate.conf.j2:backend_url = {{ redis_connection_string }} ansible/group_vars/all.yml:redis_connection_string: "redis://{% for host in groups['redis'] %}{% if host == groups['redis'][0] %}admin:{{ redis_master_password }}@{{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}?sentinel=kolla{% else %}&sentinel_fallback={{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}{% endif %}{% endfor %}&db=0&socket_timeout=60&retry_on_timeout=yes" Did anything with the distributed lock manager between Queens and Train? -----Original Message----- From: Michael Johnson <johnsomor@gmail.com> Sent: Monday, October 11, 2021 1:15 PM To: Braden, Albert <abraden@verisign.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hi Albert, Have you configured your distributed lock manager for Designate? [coordination] backend_url = <DLM URL> Michael On Fri, Oct 8, 2021 at 7:38 PM Braden, Albert <abraden@verisign.com> wrote:
Hello everyone. It’s great to be back working on OpenStack again. I’m at Verisign now. I can hardly describe how happy I am to have an employer that does not attach nonsense to the bottom of my emails!
We are rebuilding our clusters from Queens to Train. On the new Train clusters, customers are complaining that deleting a VM and then immediately creating a new one with the same name (via Terraform for example) intermittently results in a missing DNS record. We can duplicate the issue by building a VM with terraform, tainting it, and applying.
Before applying the change, we see the DNS record in the recordset:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
and we can pull it from the DNS server on the controllers:
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
After applying the change, we don’t see it:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
$
We see this in the logs:
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'c70e693b4c47402db088c43a5a177134-openstack-terra-test-host.de...' for key 'unique_recordset'")
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [SQL: INSERT INTO recordsets (id, version, created_at, zone_shard, tenant_id, zone_id, name, type, ttl, reverse_name) VALUES (%(id)s, %(version)s, %(created_at)s, %(zone_shard)s, %(tenant_id)s, %(zone_id)s, %(name)s, %(type)s, %(ttl)s, %(reverse_name)s)]
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [parameters: {'id': 'dbbb904c347241a791aa01ca33a87b23', 'version': 1, 'created_at': datetime.datetime(2021, 10, 9, 1, 53, 44, 182652), 'zone_shard': 3184, 'tenant_id': '8d1c84082a044a53abe0d519ed9e8c60', 'zone_id': 'c70e693b4c47402db088c43a5a177134', 'name': 'openstack-terra-test-host.dev-ostck.dva3.vrsn.com.', 'type': 'A', 'ttl': None, 'reverse_name': '.moc.nsrv.3avd.kctso-ved.tsoh-tset-arret-kcatsnepo'}]
It appears that Designate is trying to create the new record before the deletion of the old one finishes.
Is anyone else seeing this on Train? The same set of actions doesn’t cause this error in Queens. Do we need to change something in our Designate config, to make it wait until the old records are finished deleting before attempting to create the new ones?
You will need one of the Tooz supported distributed lock managers: Consul, Memcacded, Redis, or zookeeper. Michael On Mon, Oct 11, 2021 at 11:57 AM Braden, Albert <abraden@verisign.com> wrote:
After investigating further, I realized that we're not running redis, and I think that means that redis_connection_string doesn't get set. Does this mean that we must run redis, or is there a workaround?
-----Original Message----- From: Braden, Albert Sent: Monday, October 11, 2021 2:48 PM To: 'johnsomor@gmail.com' <johnsomor@gmail.com> Cc: 'openstack-discuss@lists.openstack.org' <openstack-discuss@lists.openstack.org> Subject: RE: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
I think so. I see this:
ansible/roles/designate/templates/designate.conf.j2:backend_url = {{ redis_connection_string }}
ansible/group_vars/all.yml:redis_connection_string: "redis://{% for host in groups['redis'] %}{% if host == groups['redis'][0] %}admin:{{ redis_master_password }}@{{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}?sentinel=kolla{% else %}&sentinel_fallback={{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}{% endif %}{% endfor %}&db=0&socket_timeout=60&retry_on_timeout=yes"
Did anything with the distributed lock manager between Queens and Train?
-----Original Message----- From: Michael Johnson <johnsomor@gmail.com> Sent: Monday, October 11, 2021 1:15 PM To: Braden, Albert <abraden@verisign.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Hi Albert,
Have you configured your distributed lock manager for Designate?
[coordination] backend_url = <DLM URL>
Michael
On Fri, Oct 8, 2021 at 7:38 PM Braden, Albert <abraden@verisign.com> wrote:
Hello everyone. It’s great to be back working on OpenStack again. I’m at Verisign now. I can hardly describe how happy I am to have an employer that does not attach nonsense to the bottom of my emails!
We are rebuilding our clusters from Queens to Train. On the new Train clusters, customers are complaining that deleting a VM and then immediately creating a new one with the same name (via Terraform for example) intermittently results in a missing DNS record. We can duplicate the issue by building a VM with terraform, tainting it, and applying.
Before applying the change, we see the DNS record in the recordset:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
and we can pull it from the DNS server on the controllers:
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
After applying the change, we don’t see it:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
$
We see this in the logs:
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'c70e693b4c47402db088c43a5a177134-openstack-terra-test-host.de...' for key 'unique_recordset'")
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [SQL: INSERT INTO recordsets (id, version, created_at, zone_shard, tenant_id, zone_id, name, type, ttl, reverse_name) VALUES (%(id)s, %(version)s, %(created_at)s, %(zone_shard)s, %(tenant_id)s, %(zone_id)s, %(name)s, %(type)s, %(ttl)s, %(reverse_name)s)]
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [parameters: {'id': 'dbbb904c347241a791aa01ca33a87b23', 'version': 1, 'created_at': datetime.datetime(2021, 10, 9, 1, 53, 44, 182652), 'zone_shard': 3184, 'tenant_id': '8d1c84082a044a53abe0d519ed9e8c60', 'zone_id': 'c70e693b4c47402db088c43a5a177134', 'name': 'openstack-terra-test-host.dev-ostck.dva3.vrsn.com.', 'type': 'A', 'ttl': None, 'reverse_name': '.moc.nsrv.3avd.kctso-ved.tsoh-tset-arret-kcatsnepo'}]
It appears that Designate is trying to create the new record before the deletion of the old one finishes.
Is anyone else seeing this on Train? The same set of actions doesn’t cause this error in Queens. Do we need to change something in our Designate config, to make it wait until the old records are finished deleting before attempting to create the new ones?
Thank you Michael, this is very helpful. Do you have any insight into why we don't experience this in Queens clusters? We aren't running a lock manager there either, and I haven't been able to duplicate the problem there. -----Original Message----- From: Michael Johnson <johnsomor@gmail.com> Sent: Monday, October 11, 2021 4:24 PM To: Braden, Albert <abraden@verisign.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. You will need one of the Tooz supported distributed lock managers: Consul, Memcacded, Redis, or zookeeper. Michael On Mon, Oct 11, 2021 at 11:57 AM Braden, Albert <abraden@verisign.com> wrote:
After investigating further, I realized that we're not running redis, and I think that means that redis_connection_string doesn't get set. Does this mean that we must run redis, or is there a workaround?
-----Original Message----- From: Braden, Albert Sent: Monday, October 11, 2021 2:48 PM To: 'johnsomor@gmail.com' <johnsomor@gmail.com> Cc: 'openstack-discuss@lists.openstack.org' <openstack-discuss@lists.openstack.org> Subject: RE: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
I think so. I see this:
ansible/roles/designate/templates/designate.conf.j2:backend_url = {{ redis_connection_string }}
ansible/group_vars/all.yml:redis_connection_string: "redis://{% for host in groups['redis'] %}{% if host == groups['redis'][0] %}admin:{{ redis_master_password }}@{{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}?sentinel=kolla{% else %}&sentinel_fallback={{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}{% endif %}{% endfor %}&db=0&socket_timeout=60&retry_on_timeout=yes"
Did anything with the distributed lock manager between Queens and Train?
-----Original Message----- From: Michael Johnson <johnsomor@gmail.com> Sent: Monday, October 11, 2021 1:15 PM To: Braden, Albert <abraden@verisign.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Hi Albert,
Have you configured your distributed lock manager for Designate?
[coordination] backend_url = <DLM URL>
Michael
On Fri, Oct 8, 2021 at 7:38 PM Braden, Albert <abraden@verisign.com> wrote:
Hello everyone. It’s great to be back working on OpenStack again. I’m at Verisign now. I can hardly describe how happy I am to have an employer that does not attach nonsense to the bottom of my emails!
We are rebuilding our clusters from Queens to Train. On the new Train clusters, customers are complaining that deleting a VM and then immediately creating a new one with the same name (via Terraform for example) intermittently results in a missing DNS record. We can duplicate the issue by building a VM with terraform, tainting it, and applying.
Before applying the change, we see the DNS record in the recordset:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
and we can pull it from the DNS server on the controllers:
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
After applying the change, we don’t see it:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
$
We see this in the logs:
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'c70e693b4c47402db088c43a5a177134-openstack-terra-test-host.de...' for key 'unique_recordset'")
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [SQL: INSERT INTO recordsets (id, version, created_at, zone_shard, tenant_id, zone_id, name, type, ttl, reverse_name) VALUES (%(id)s, %(version)s, %(created_at)s, %(zone_shard)s, %(tenant_id)s, %(zone_id)s, %(name)s, %(type)s, %(ttl)s, %(reverse_name)s)]
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [parameters: {'id': 'dbbb904c347241a791aa01ca33a87b23', 'version': 1, 'created_at': datetime.datetime(2021, 10, 9, 1, 53, 44, 182652), 'zone_shard': 3184, 'tenant_id': '8d1c84082a044a53abe0d519ed9e8c60', 'zone_id': 'c70e693b4c47402db088c43a5a177134', 'name': 'openstack-terra-test-host.dev-ostck.dva3.vrsn.com.', 'type': 'A', 'ttl': None, 'reverse_name': '.moc.nsrv.3avd.kctso-ved.tsoh-tset-arret-kcatsnepo'}]
It appears that Designate is trying to create the new record before the deletion of the old one finishes.
Is anyone else seeing this on Train? The same set of actions doesn’t cause this error in Queens. Do we need to change something in our Designate config, to make it wait until the old records are finished deleting before attempting to create the new ones?
I don't have a good answer for you on that as it pre-dates my history with Designate a bit. I suspect it has to do with the removal of the pool-manager and the restructuring of the controller code. Maybe someone else on the discuss list has more insight. Michael On Tue, Oct 12, 2021 at 5:47 AM Braden, Albert <abraden@verisign.com> wrote:
Thank you Michael, this is very helpful. Do you have any insight into why we don't experience this in Queens clusters? We aren't running a lock manager there either, and I haven't been able to duplicate the problem there.
-----Original Message----- From: Michael Johnson <johnsomor@gmail.com> Sent: Monday, October 11, 2021 4:24 PM To: Braden, Albert <abraden@verisign.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
You will need one of the Tooz supported distributed lock managers: Consul, Memcacded, Redis, or zookeeper.
Michael
On Mon, Oct 11, 2021 at 11:57 AM Braden, Albert <abraden@verisign.com> wrote:
After investigating further, I realized that we're not running redis, and I think that means that redis_connection_string doesn't get set. Does this mean that we must run redis, or is there a workaround?
-----Original Message----- From: Braden, Albert Sent: Monday, October 11, 2021 2:48 PM To: 'johnsomor@gmail.com' <johnsomor@gmail.com> Cc: 'openstack-discuss@lists.openstack.org' <openstack-discuss@lists.openstack.org> Subject: RE: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
I think so. I see this:
ansible/roles/designate/templates/designate.conf.j2:backend_url = {{ redis_connection_string }}
ansible/group_vars/all.yml:redis_connection_string: "redis://{% for host in groups['redis'] %}{% if host == groups['redis'][0] %}admin:{{ redis_master_password }}@{{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}?sentinel=kolla{% else %}&sentinel_fallback={{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}{% endif %}{% endfor %}&db=0&socket_timeout=60&retry_on_timeout=yes"
Did anything with the distributed lock manager between Queens and Train?
-----Original Message----- From: Michael Johnson <johnsomor@gmail.com> Sent: Monday, October 11, 2021 1:15 PM To: Braden, Albert <abraden@verisign.com> Cc: openstack-discuss@lists.openstack.org Subject: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Hi Albert,
Have you configured your distributed lock manager for Designate?
[coordination] backend_url = <DLM URL>
Michael
On Fri, Oct 8, 2021 at 7:38 PM Braden, Albert <abraden@verisign.com> wrote:
Hello everyone. It’s great to be back working on OpenStack again. I’m at Verisign now. I can hardly describe how happy I am to have an employer that does not attach nonsense to the bottom of my emails!
We are rebuilding our clusters from Queens to Train. On the new Train clusters, customers are complaining that deleting a VM and then immediately creating a new one with the same name (via Terraform for example) intermittently results in a missing DNS record. We can duplicate the issue by building a VM with terraform, tainting it, and applying.
Before applying the change, we see the DNS record in the recordset:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
and we can pull it from the DNS server on the controllers:
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
After applying the change, we don’t see it:
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
| f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com. | A | 10.220.4.89 | ACTIVE | NONE |
$
$ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
$ openstack recordset list dva3.vrsn.com. --all |grep openstack-terra
$
We see this in the logs:
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'c70e693b4c47402db088c43a5a177134-openstack-terra-test-host.de...' for key 'unique_recordset'")
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [SQL: INSERT INTO recordsets (id, version, created_at, zone_shard, tenant_id, zone_id, name, type, ttl, reverse_name) VALUES (%(id)s, %(version)s, %(created_at)s, %(zone_shard)s, %(tenant_id)s, %(zone_id)s, %(name)s, %(type)s, %(ttl)s, %(reverse_name)s)]
2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [parameters: {'id': 'dbbb904c347241a791aa01ca33a87b23', 'version': 1, 'created_at': datetime.datetime(2021, 10, 9, 1, 53, 44, 182652), 'zone_shard': 3184, 'tenant_id': '8d1c84082a044a53abe0d519ed9e8c60', 'zone_id': 'c70e693b4c47402db088c43a5a177134', 'name': 'openstack-terra-test-host.dev-ostck.dva3.vrsn.com.', 'type': 'A', 'ttl': None, 'reverse_name': '.moc.nsrv.3avd.kctso-ved.tsoh-tset-arret-kcatsnepo'}]
It appears that Designate is trying to create the new record before the deletion of the old one finishes.
Is anyone else seeing this on Train? The same set of actions doesn’t cause this error in Queens. Do we need to change something in our Designate config, to make it wait until the old records are finished deleting before attempting to create the new ones?
participants (2)
-
Braden, Albert
-
Michael Johnson