[kolla] [train] [designate] Terraform "repave" causes DNS records to become orphaned

Albert Braden ozzzo at yahoo.com
Thu Feb 16 20:46:18 UTC 2023


 Yes, we have 3 controllers per region. Theoretically we could write some TF code that would wait for the deletions to finish before rebuilding; the hard part would be getting our customers to deploy it. For them TF is just a thing that builds servers so that they can work, and asking them to change it would be a heavy burden. I'm hoping to find a way to fix it in Openstack.
     On Thursday, February 16, 2023, 03:14:30 PM EST, Eugen Block <eblock at nde.ag> wrote:  
 
 I wonder if it’s the same (or similar) issue I asked about in November  
[1]. Do you have a HA cloud with multiple control nodes? One of our  
customers also uses terraform to deploy clusters and they have to  
enable a sleep between destroy and create commands, otherwise a wrong  
(deleted) project ID will be applied. We figured out it was the  
keystone role cache but still haven’t found a way to achieve both a  
reasonable performance (tried different cache settings) and quicker  
terraform redeployments.

[1]  
https://lists.openstack.org/pipermail/openstack-discuss/2022-November/031122.html


Zitat von Mohammed Naser <mnaser at vexxhost.com>:

> On Thu, Feb 16, 2023 at 12:57 PM Albert Braden <ozzzo at yahoo.com> wrote:
>
>> We have customers who use Terraform to build their clusters. They do a
>> thing that they call “repave” where they run an ansible playbook that calls
>> “terraform destroy” and then immediately calls “terraform apply” to rebuild
>> the cluster. It looks like Designate is not able to keep up, and it fails
>> too delete one or more of the DNS records. We have 3 records, IPv4 forward
>> (A) and reverse (PTR) and IPv6 forward (AAAA).
>>
>> When Designate fails to delete a record, it becomes orphaned. On the next
>> “repave” the record is not deleted, because it’s not associated with the
>> new VM, and we see errors in designate-sink.log:
>>
>> 2023-02-13 02:49:40.824 27 ERROR oslo_messaging.notify.dispatcher
>> [parameters: {'id': '1282a6780f2f493c81ed20bc62ef370f', 'version': 1,
>> 'created_at': datetime.datetime(2023, 2, 13, 2, 49, 40, 814726),
>> 'zone_shard': 97, 'tenant_id': '130b797392d24b408e73c2be545d0a20',
>> 'zone_id': '0616b8e0852540e59fd383cfb678af32', 'recordset_id':
>> '1fc5a9eaea824d0f8b53eb91ea9ff6e2', 'data': '10.22.0.210', 'hash':
>> 'e3270256501fceb97a14d4133d394880', 'managed': 1, 'managed_plugin_type':
>> 'handler', 'managed_plugin_name': 'our_nova_fixed',
>> 'managed_resource_type': 'instance', 'managed_resource_id':
>> '842833cb9410404bbd5009eb6e0bf90a', 'status': 'PENDING', 'action':
>> 'UPDATE', 'serial': 1676256582}]
>>>> 2023-02-13 02:49:40.824 27 ERROR oslo_messaging.notify.dispatcher
>> designate.exceptions.DuplicateRecord: Duplicate Record
>>
>> The orphaned record is causing a mariadb collision because a record with
>> that name and IP already exists. When this happens with an IPv6 record, it
>> looks like Designate tries to create the IPv6 record, and fails, and then
>> does not try to create an IPv4 record, which causes trouble because
>> Terraform waits for the name resolution to work.
>>
>> The obvious solution is to tell TF users to introduce a delay between
>> “destroy” and “apply” but that would be non-trivial for them, and we would
>> prefer to fix it on our end. What can I do, to make Designate gracefully
>> manage cases where a cluster is deleted and then immediately rebuilt with
>> the same names and IPs? Also, how can I clean up these orphaned records.
>> I’ve been asking the customer to destroy, and then deleting the record, and
>> then asking them to rebuild, but that is a manual process for them. Is it
>> possible to link the orphaned record to the new VM so that it will be
>> deleted on the next “repave?”
>>
>
> or perhaps the Terraform module should wait until the resource is fully
> gone in case the delete is actually asynchronus? same way that a VM delete
> is asynchronus
>
>
>> Example:
>>
>> This VM was built today:
>> $ os server show f5e75688-5fa9-41b6-876f-289e0ebc04b9|grep launched_at
>> | OS-SRV-USG:launched_at              | 2023-02-16T02:48:49.000000
>>
>> The A record was created in January:
>> $ os recordset show 0616b8e0852540e59fd383cfb678af32
>> 1fc5a9ea-ea82-4d0f-8b53-eb91ea9ff6e2|grep created_at
>> | created_at  | 2023-01-25T02:48:52.000000          |
>>
>>
>
> --
> Mohammed Naser
> VEXXHOST, Inc.




  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.openstack.org/pipermail/openstack-discuss/attachments/20230216/9e90248c/attachment-0001.htm>


More information about the openstack-discuss mailing list