[kolla] [train] [designate] Terraform "repave" causes DNS records to become orphaned

Eugen Block eblock at nde.ag
Tue Feb 21 14:10:44 UTC 2023


I created https://bugs.launchpad.net/keystone/+bug/2007982

Zitat von Eugen Block <eblock at nde.ag>:

> I agree, I had also hoped to get some more insights here on this  
> list but got no response yet. Maybe I should create a bug report for  
> this role cache issue, that could draw some attention to it.
>
> Zitat von Albert Braden <ozzzo at yahoo.com>:
>
>> Yes, we have 3 controllers per region. Theoretically we could write  
>> some TF code that would wait for the deletions to finish before  
>> rebuilding; the hard part would be getting our customers to deploy  
>> it. For them TF is just a thing that builds servers so that they  
>> can work, and asking them to change it would be a heavy burden. I'm  
>> hoping to find a way to fix it in Openstack.
>>     On Thursday, February 16, 2023, 03:14:30 PM EST, Eugen Block  
>> <eblock at nde.ag> wrote:
>>
>> I wonder if it’s the same (or similar) issue I asked about in November 
>> [1]. Do you have a HA cloud with multiple control nodes? One of our 
>> customers also uses terraform to deploy clusters and they have to 
>> enable a sleep between destroy and create commands, otherwise a wrong 
>> (deleted) project ID will be applied. We figured out it was the 
>> keystone role cache but still haven’t found a way to achieve both a 
>> reasonable performance (tried different cache settings) and quicker 
>> terraform redeployments.
>>
>> [1] 
>> https://lists.openstack.org/pipermail/openstack-discuss/2022-November/031122.html
>>
>>
>> Zitat von Mohammed Naser <mnaser at vexxhost.com>:
>>
>>> On Thu, Feb 16, 2023 at 12:57 PM Albert Braden <ozzzo at yahoo.com> wrote:
>>>
>>>> We have customers who use Terraform to build their clusters. They do a
>>>> thing that they call “repave” where they run an ansible playbook  
>>>> that calls
>>>> “terraform destroy” and then immediately calls “terraform apply”  
>>>> to rebuild
>>>> the cluster. It looks like Designate is not able to keep up, and it fails
>>>> too delete one or more of the DNS records. We have 3 records, IPv4 forward
>>>> (A) and reverse (PTR) and IPv6 forward (AAAA).
>>>>
>>>> When Designate fails to delete a record, it becomes orphaned. On the next
>>>> “repave” the record is not deleted, because it’s not associated with the
>>>> new VM, and we see errors in designate-sink.log:
>>>>
>>>> 2023-02-13 02:49:40.824 27 ERROR oslo_messaging.notify.dispatcher
>>>> [parameters: {'id': '1282a6780f2f493c81ed20bc62ef370f', 'version': 1,
>>>> 'created_at': datetime.datetime(2023, 2, 13, 2, 49, 40, 814726),
>>>> 'zone_shard': 97, 'tenant_id': '130b797392d24b408e73c2be545d0a20',
>>>> 'zone_id': '0616b8e0852540e59fd383cfb678af32', 'recordset_id':
>>>> '1fc5a9eaea824d0f8b53eb91ea9ff6e2', 'data': '10.22.0.210', 'hash':
>>>> 'e3270256501fceb97a14d4133d394880', 'managed': 1, 'managed_plugin_type':
>>>> 'handler', 'managed_plugin_name': 'our_nova_fixed',
>>>> 'managed_resource_type': 'instance', 'managed_resource_id':
>>>> '842833cb9410404bbd5009eb6e0bf90a', 'status': 'PENDING', 'action':
>>>> 'UPDATE', 'serial': 1676256582}]
>>>>>>>> 2023-02-13 02:49:40.824 27 ERROR oslo_messaging.notify.dispatcher
>>>> designate.exceptions.DuplicateRecord: Duplicate Record
>>>>
>>>> The orphaned record is causing a mariadb collision because a record with
>>>> that name and IP already exists. When this happens with an IPv6 record, it
>>>> looks like Designate tries to create the IPv6 record, and fails, and then
>>>> does not try to create an IPv4 record, which causes trouble because
>>>> Terraform waits for the name resolution to work.
>>>>
>>>> The obvious solution is to tell TF users to introduce a delay between
>>>> “destroy” and “apply” but that would be non-trivial for them, and we would
>>>> prefer to fix it on our end. What can I do, to make Designate gracefully
>>>> manage cases where a cluster is deleted and then immediately rebuilt with
>>>> the same names and IPs? Also, how can I clean up these orphaned records.
>>>> I’ve been asking the customer to destroy, and then deleting the  
>>>> record, and
>>>> then asking them to rebuild, but that is a manual process for them. Is it
>>>> possible to link the orphaned record to the new VM so that it will be
>>>> deleted on the next “repave?”
>>>>
>>>
>>> or perhaps the Terraform module should wait until the resource is fully
>>> gone in case the delete is actually asynchronus? same way that a VM delete
>>> is asynchronus
>>>
>>>
>>>> Example:
>>>>
>>>> This VM was built today:
>>>> $ os server show f5e75688-5fa9-41b6-876f-289e0ebc04b9|grep launched_at
>>>> | OS-SRV-USG:launched_at              | 2023-02-16T02:48:49.000000
>>>>
>>>> The A record was created in January:
>>>> $ os recordset show 0616b8e0852540e59fd383cfb678af32
>>>> 1fc5a9ea-ea82-4d0f-8b53-eb91ea9ff6e2|grep created_at
>>>> | created_at  | 2023-01-25T02:48:52.000000          |
>>>>
>>>>
>>>
>>> --
>>> Mohammed Naser
>>> VEXXHOST, Inc.






More information about the openstack-discuss mailing list