[kolla] [train] [designate] Terraform "repave" causes DNS records to become orphaned

17 Feb 2023

      We have customers who use Terraform to build their clusters. They do a thing that they call “repave” where they run an ansible playbook that calls “terraform destroy” and then immediately calls “terraform apply” to rebuild the cluster. It looks like Designate is not able to keep up, and it fails too delete one or more of the DNS records. We have 3 records, IPv4 forward (A) and reverse (PTR) and IPv6 forward (AAAA).

When Designate fails to delete a record, it becomes orphaned. On the next “repave” the record is not deleted, because it’s not associated with the new VM, and we see errors in designate-sink.log:

2023-02-13 02:49:40.824 27 ERROR oslo_messaging.notify.dispatcher [parameters: {'id': '1282a6780f2f493c81ed20bc62ef370f', 'version': 1, 'created_at': datetime.datetime(2023, 2, 13, 2, 49, 40, 814726), 'zone_shard': 97, 'tenant_id': '130b797392d24b408e73c2be545d0a20', 'zone_id': '0616b8e0852540e59fd383cfb678af32', 'recordset_id': '1fc5a9eaea824d0f8b53eb91ea9ff6e2', 'data': '10.22.0.210', 'hash': 'e3270256501fceb97a14d4133d394880', 'managed': 1, 'managed_plugin_type': 'handler', 'managed_plugin_name': 'our_nova_fixed', 'managed_resource_type': 'instance', 'managed_resource_id': '842833cb9410404bbd5009eb6e0bf90a', 'status': 'PENDING', 'action': 'UPDATE', 'serial': 1676256582}]
…
2023-02-13 02:49:40.824 27 ERROR oslo_messaging.notify.dispatcher designate.exceptions.DuplicateRecord: Duplicate Record

The orphaned record is causing a mariadb collision because a record with that name and IP already exists. When this happens with an IPv6 record, it looks like Designate tries to create the IPv6 record, and fails, and then does not try to create an IPv4 record, which causes trouble because Terraform waits for the name resolution to work.

The obvious solution is to tell TF users to introduce a delay between “destroy” and “apply” but that would be non-trivial for them, and we would prefer to fix it on our end. What can I do, to make Designate gracefully manage cases where a cluster is deleted and then immediately rebuilt with the same names and IPs? Also, how can I clean up these orphaned records. I’ve been asking the customer to destroy, and then deleting the record, and then asking them to rebuild, but that is a manual process for them. Is it possible to link the orphaned record to the new VM so that it will be deleted on the next “repave?”

Example:

This VM was built today:
$ os server show f5e75688-5fa9-41b6-876f-289e0ebc04b9|grep launched_at
| OS-SRV-USG:launched_at              | 2023-02-16T02:48:49.000000  

The A record was created in January:
$ os recordset show 0616b8e0852540e59fd383cfb678af32 1fc5a9ea-ea82-4d0f-8b53-eb91ea9ff6e2|grep created_at
| created_at  | 2023-01-25T02:48:52.000000           |

Albert Braden

Mohammed Naser

Eugen Block

Albert Braden

Eugen Block

Eugen Block

tags

participants (3)