Re: [kolla] [train] [designate] Terraform "repave" causes DNS records to become orphaned

21 Feb 2023


      I created https://bugs.launchpad.net/keystone/+bug/2007982

Zitat von Eugen Block <eblock@nde.ag>:
...
I agree, I had also hoped to get some more insights here on this  
list but got no response yet. Maybe I should create a bug report for  
this role cache issue, that could draw some attention to it.
Zitat von Albert Braden <ozzzo@yahoo.com>:
...
Yes, we have 3 controllers per region. Theoretically we could write  
some TF code that would wait for the deletions to finish before  
rebuilding; the hard part would be getting our customers to deploy  
it. For them TF is just a thing that builds servers so that they  
can work, and asking them to change it would be a heavy burden. I'm  
hoping to find a way to fix it in Openstack.
    On Thursday, February 16, 2023, 03:14:30 PM EST, Eugen Block  
<eblock@nde.ag> wrote:
I wonder if it’s the same (or similar) issue I asked about in November 
[1]. Do you have a HA cloud with multiple control nodes? One of our 
customers also uses terraform to deploy clusters and they have to 
enable a sleep between destroy and create commands, otherwise a wrong 
(deleted) project ID will be applied. We figured out it was the 
keystone role cache but still haven’t found a way to achieve both a 
reasonable performance (tried different cache settings) and quicker 
terraform redeployments.
[1] 
https://lists.openstack.org/pipermail/openstack-discuss/2022-November/031122...
Zitat von Mohammed Naser <mnaser@vexxhost.com>:
...
On Thu, Feb 16, 2023 at 12:57 PM Albert Braden <ozzzo@yahoo.com> wrote:
...
We have customers who use Terraform to build their clusters. They do a
thing that they call “repave” where they run an ansible playbook  
that calls
“terraform destroy” and then immediately calls “terraform apply”  
to rebuild
the cluster. It looks like Designate is not able to keep up, and it fails
too delete one or more of the DNS records. We have 3 records, IPv4 forward
(A) and reverse (PTR) and IPv6 forward (AAAA).
When Designate fails to delete a record, it becomes orphaned. On the next
“repave” the record is not deleted, because it’s not associated with the
new VM, and we see errors in designate-sink.log:
2023-02-13 02:49:40.824 27 ERROR oslo_messaging.notify.dispatcher
[parameters: {'id': '1282a6780f2f493c81ed20bc62ef370f', 'version': 1,
'created_at': datetime.datetime(2023, 2, 13, 2, 49, 40, 814726),
'zone_shard': 97, 'tenant_id': '130b797392d24b408e73c2be545d0a20',
'zone_id': '0616b8e0852540e59fd383cfb678af32', 'recordset_id':
'1fc5a9eaea824d0f8b53eb91ea9ff6e2', 'data': '10.22.0.210', 'hash':
'e3270256501fceb97a14d4133d394880', 'managed': 1, 'managed_plugin_type':
'handler', 'managed_plugin_name': 'our_nova_fixed',
'managed_resource_type': 'instance', 'managed_resource_id':
'842833cb9410404bbd5009eb6e0bf90a', 'status': 'PENDING', 'action':
'UPDATE', 'serial': 1676256582}]
…
2023-02-13 02:49:40.824 27 ERROR oslo_messaging.notify.dispatcher
designate.exceptions.DuplicateRecord: Duplicate Record
The orphaned record is causing a mariadb collision because a record with
that name and IP already exists. When this happens with an IPv6 record, it
looks like Designate tries to create the IPv6 record, and fails, and then
does not try to create an IPv4 record, which causes trouble because
Terraform waits for the name resolution to work.
The obvious solution is to tell TF users to introduce a delay between
“destroy” and “apply” but that would be non-trivial for them, and we would
prefer to fix it on our end. What can I do, to make Designate gracefully
manage cases where a cluster is deleted and then immediately rebuilt with
the same names and IPs? Also, how can I clean up these orphaned records.
I’ve been asking the customer to destroy, and then deleting the  
record, and
then asking them to rebuild, but that is a manual process for them. Is it
possible to link the orphaned record to the new VM so that it will be
deleted on the next “repave?”
or perhaps the Terraform module should wait until the resource is fully
gone in case the delete is actually asynchronus? same way that a VM delete
is asynchronus
...
Example:
This VM was built today:
$ os server show f5e75688-5fa9-41b6-876f-289e0ebc04b9|grep launched_at
| OS-SRV-USG:launched_at              | 2023-02-16T02:48:49.000000
The A record was created in January:
$ os recordset show 0616b8e0852540e59fd383cfb678af32
1fc5a9ea-ea82-4d0f-8b53-eb91ea9ff6e2|grep created_at
| created_at  | 2023-01-25T02:48:52.000000          |
--
Mohammed Naser
VEXXHOST, Inc.