[kolla] [train] [designate] Terraform "repave" causes DNS records to become orphaned

Eugen Block eblock at nde.ag
Tue Feb 21 09:33:28 UTC 2023


I agree, I had also hoped to get some more insights here on this list  
but got no response yet. Maybe I should create a bug report for this  
role cache issue, that could draw some attention to it.

Zitat von Albert Braden <ozzzo at yahoo.com>:

> Yes, we have 3 controllers per region. Theoretically we could write  
> some TF code that would wait for the deletions to finish before  
> rebuilding; the hard part would be getting our customers to deploy  
> it. For them TF is just a thing that builds servers so that they can  
> work, and asking them to change it would be a heavy burden. I'm  
> hoping to find a way to fix it in Openstack.
>      On Thursday, February 16, 2023, 03:14:30 PM EST, Eugen Block  
> <eblock at nde.ag> wrote:
>
>  I wonder if it’s the same (or similar) issue I asked about in November 
> [1]. Do you have a HA cloud with multiple control nodes? One of our 
> customers also uses terraform to deploy clusters and they have to 
> enable a sleep between destroy and create commands, otherwise a wrong 
> (deleted) project ID will be applied. We figured out it was the 
> keystone role cache but still haven’t found a way to achieve both a 
> reasonable performance (tried different cache settings) and quicker 
> terraform redeployments.
>
> [1] 
> https://lists.openstack.org/pipermail/openstack-discuss/2022-November/031122.html
>
>
> Zitat von Mohammed Naser <mnaser at vexxhost.com>:
>
>> On Thu, Feb 16, 2023 at 12:57 PM Albert Braden <ozzzo at yahoo.com> wrote:
>>
>>> We have customers who use Terraform to build their clusters. They do a
>>> thing that they call “repave” where they run an ansible playbook that calls
>>> “terraform destroy” and then immediately calls “terraform apply” to rebuild
>>> the cluster. It looks like Designate is not able to keep up, and it fails
>>> too delete one or more of the DNS records. We have 3 records, IPv4 forward
>>> (A) and reverse (PTR) and IPv6 forward (AAAA).
>>>
>>> When Designate fails to delete a record, it becomes orphaned. On the next
>>> “repave” the record is not deleted, because it’s not associated with the
>>> new VM, and we see errors in designate-sink.log:
>>>
>>> 2023-02-13 02:49:40.824 27 ERROR oslo_messaging.notify.dispatcher
>>> [parameters: {'id': '1282a6780f2f493c81ed20bc62ef370f', 'version': 1,
>>> 'created_at': datetime.datetime(2023, 2, 13, 2, 49, 40, 814726),
>>> 'zone_shard': 97, 'tenant_id': '130b797392d24b408e73c2be545d0a20',
>>> 'zone_id': '0616b8e0852540e59fd383cfb678af32', 'recordset_id':
>>> '1fc5a9eaea824d0f8b53eb91ea9ff6e2', 'data': '10.22.0.210', 'hash':
>>> 'e3270256501fceb97a14d4133d394880', 'managed': 1, 'managed_plugin_type':
>>> 'handler', 'managed_plugin_name': 'our_nova_fixed',
>>> 'managed_resource_type': 'instance', 'managed_resource_id':
>>> '842833cb9410404bbd5009eb6e0bf90a', 'status': 'PENDING', 'action':
>>> 'UPDATE', 'serial': 1676256582}]
>>>>>> 2023-02-13 02:49:40.824 27 ERROR oslo_messaging.notify.dispatcher
>>> designate.exceptions.DuplicateRecord: Duplicate Record
>>>
>>> The orphaned record is causing a mariadb collision because a record with
>>> that name and IP already exists. When this happens with an IPv6 record, it
>>> looks like Designate tries to create the IPv6 record, and fails, and then
>>> does not try to create an IPv4 record, which causes trouble because
>>> Terraform waits for the name resolution to work.
>>>
>>> The obvious solution is to tell TF users to introduce a delay between
>>> “destroy” and “apply” but that would be non-trivial for them, and we would
>>> prefer to fix it on our end. What can I do, to make Designate gracefully
>>> manage cases where a cluster is deleted and then immediately rebuilt with
>>> the same names and IPs? Also, how can I clean up these orphaned records.
>>> I’ve been asking the customer to destroy, and then deleting the record, and
>>> then asking them to rebuild, but that is a manual process for them. Is it
>>> possible to link the orphaned record to the new VM so that it will be
>>> deleted on the next “repave?”
>>>
>>
>> or perhaps the Terraform module should wait until the resource is fully
>> gone in case the delete is actually asynchronus? same way that a VM delete
>> is asynchronus
>>
>>
>>> Example:
>>>
>>> This VM was built today:
>>> $ os server show f5e75688-5fa9-41b6-876f-289e0ebc04b9|grep launched_at
>>> | OS-SRV-USG:launched_at              | 2023-02-16T02:48:49.000000
>>>
>>> The A record was created in January:
>>> $ os recordset show 0616b8e0852540e59fd383cfb678af32
>>> 1fc5a9ea-ea82-4d0f-8b53-eb91ea9ff6e2|grep created_at
>>> | created_at  | 2023-01-25T02:48:52.000000          |
>>>
>>>
>>
>> --
>> Mohammed Naser
>> VEXXHOST, Inc.






More information about the openstack-discuss mailing list