After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail

Michael Johnson johnsomor at gmail.com
Tue Oct 12 15:32:43 UTC 2021


I don't have a good answer for you on that as it pre-dates my history
with Designate a bit. I suspect it has to do with the removal of the
pool-manager and the restructuring of the controller code.

Maybe someone else on the discuss list has more insight.

Michael

On Tue, Oct 12, 2021 at 5:47 AM Braden, Albert <abraden at verisign.com> wrote:
>
> Thank you Michael, this is very helpful. Do you have any insight into why we don't experience this in Queens clusters? We aren't running a lock manager there either, and I haven't been able to duplicate the problem there.
>
> -----Original Message-----
> From: Michael Johnson <johnsomor at gmail.com>
> Sent: Monday, October 11, 2021 4:24 PM
> To: Braden, Albert <abraden at verisign.com>
> Cc: openstack-discuss at lists.openstack.org
> Subject: [EXTERNAL] Re: Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
>
> Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
>
> You will need one of the Tooz supported distributed lock managers:
> Consul, Memcacded, Redis, or zookeeper.
>
> Michael
>
> On Mon, Oct 11, 2021 at 11:57 AM Braden, Albert <abraden at verisign.com> wrote:
> >
> > After investigating further, I realized that we're not running redis, and I think that means that redis_connection_string doesn't get set. Does this mean that we must run redis, or is there a workaround?
> >
> > -----Original Message-----
> > From: Braden, Albert
> > Sent: Monday, October 11, 2021 2:48 PM
> > To: 'johnsomor at gmail.com' <johnsomor at gmail.com>
> > Cc: 'openstack-discuss at lists.openstack.org' <openstack-discuss at lists.openstack.org>
> > Subject: RE: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
> >
> > I think so. I see this:
> >
> > ansible/roles/designate/templates/designate.conf.j2:backend_url = {{ redis_connection_string }}
> >
> > ansible/group_vars/all.yml:redis_connection_string: "redis://{% for host in groups['redis'] %}{% if host == groups['redis'][0] %}admin:{{ redis_master_password }}@{{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}?sentinel=kolla{% else %}&sentinel_fallback={{ 'api' | kolla_address(host) | put_address_in_context('url') }}:{{ redis_sentinel_port }}{% endif %}{% endfor %}&db=0&socket_timeout=60&retry_on_timeout=yes"
> >
> > Did anything with the distributed lock manager between Queens and Train?
> >
> > -----Original Message-----
> > From: Michael Johnson <johnsomor at gmail.com>
> > Sent: Monday, October 11, 2021 1:15 PM
> > To: Braden, Albert <abraden at verisign.com>
> > Cc: openstack-discuss at lists.openstack.org
> > Subject: [EXTERNAL] Re: After rebuilding Queens clusters on Train, race condition causes Designate record creation to fail
> >
> > Caution: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> >
> > Hi Albert,
> >
> > Have you configured your distributed lock manager for Designate?
> >
> > [coordination]
> > backend_url = <DLM URL>
> >
> > Michael
> >
> > On Fri, Oct 8, 2021 at 7:38 PM Braden, Albert <abraden at verisign.com> wrote:
> > >
> > > Hello everyone. It’s great to be back working on OpenStack again. I’m at Verisign now. I can hardly describe how happy I am to have an employer that does not attach nonsense to the bottom of my emails!
> > >
> > >
> > >
> > > We are rebuilding our clusters from Queens to Train. On the new Train clusters, customers are complaining that deleting a VM and then immediately creating a new one with the same name (via Terraform for example) intermittently results in a missing DNS record. We can duplicate the issue by building a VM with terraform, tainting it, and applying.
> > >
> > >
> > >
> > > Before applying the change, we see the DNS record in the recordset:
> > >
> > >
> > >
> > > $ openstack recordset list dva3.vrsn.com.  --all |grep openstack-terra
> > >
> > > | f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com.        | A     | 10.220.4.89                                                           | ACTIVE | NONE   |
> > >
> > > $
> > >
> > >
> > >
> > > and we can pull it from the DNS server on the controllers:
> > >
> > >
> > >
> > > $ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
> > >
> > > openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
> > >
> > > openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
> > >
> > > openstack-terra-test-host.dev-ostck.dva3.vrsn.com. 1 IN A 10.220.4.89
> > >
> > >
> > >
> > > After applying the change, we don’t see it:
> > >
> > >
> > >
> > > $ openstack recordset list dva3.vrsn.com.  --all |grep openstack-terra
> > >
> > > | f9aa73c1-84ba-4854-be71-cbb616de672c | 8d1c84082a044a53abe0d519ed9e8c60 | openstack-terra-test-host.dev-ostck.dva3.vrsn.com.        | A     | 10.220.4.89                                                           | ACTIVE | NONE   |
> > >
> > > $
> > >
> > > $ for i in {1..3}; do dig @dva3-ctrl${i}.cloud.vrsn.com -t axfr dva3.vrsn.com. |grep openstack-terra; done
> > >
> > > $ openstack recordset list dva3.vrsn.com.  --all |grep openstack-terra
> > >
> > > $
> > >
> > >
> > >
> > > We see this in the logs:
> > >
> > >
> > >
> > > 2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher oslo_db.exception.DBDuplicateEntry: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'c70e693b4c47402db088c43a5a177134-openstack-terra-test-host.de...' for key 'unique_recordset'")
> > >
> > > 2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [SQL: INSERT INTO recordsets (id, version, created_at, zone_shard, tenant_id, zone_id, name, type, ttl, reverse_name) VALUES (%(id)s, %(version)s, %(created_at)s, %(zone_shard)s, %(tenant_id)s, %(zone_id)s, %(name)s, %(type)s, %(ttl)s, %(reverse_name)s)]
> > >
> > > 2021-10-09 01:53:44.307 27 ERROR oslo_messaging.notify.dispatcher [parameters: {'id': 'dbbb904c347241a791aa01ca33a87b23', 'version': 1, 'created_at': datetime.datetime(2021, 10, 9, 1, 53, 44, 182652), 'zone_shard': 3184, 'tenant_id': '8d1c84082a044a53abe0d519ed9e8c60', 'zone_id': 'c70e693b4c47402db088c43a5a177134', 'name': 'openstack-terra-test-host.dev-ostck.dva3.vrsn.com.', 'type': 'A', 'ttl': None, 'reverse_name': '.moc.nsrv.3avd.kctso-ved.tsoh-tset-arret-kcatsnepo'}]
> > >
> > >
> > >
> > > It appears that Designate is trying to create the new record before the deletion of the old one finishes.
> > >
> > >
> > >
> > > Is anyone else seeing this on Train? The same set of actions doesn’t cause this error in Queens. Do we need to change something in our Designate config, to make it wait until the old records are finished deleting before attempting to create the new ones?



More information about the openstack-discuss mailing list