It looks like you have problems with Octavia service that is responsible for creating LB for API Server and etcd. Verify Octavia logs and check if creating a LB manually (openstack CLI/GUI) works correctly in your environment. Kind regards, Paweł W dniu 21.08.2024 o 11:24, Michel Jouvin pisze:
Additional information: it seems a similar issue happens when deleting a cluster leading to a deletion failure (cluster status = DELETE_FAILED) at first attempt, requiring a second attempt to delete the cluster successfully.
During first attempt, the following exception occurs in magnum-conductor logs:
-----
087 6a7221ba840247cdae245d7a61e0023b - - c0134d6e05ce4401ab78951ff1f2604c -] Exception during message handling: magnum.common.exception.PreDeletionFailed: Failed to pre-del ete resources for cluster cd9d4497-8147-45e9-a9f6-473754d1e290, error: Timeout waiting for the load balancers {'a193b831-9c7f-469e-aa9e-c8bc95777b19', 'd7af14d7-64ee-444e-9 988-079c53661dcc'} to be deleted.. 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/magnum/common/octavia.py", line 134, in delete_loadbalancers 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server wait_for_lb_deleted(octavia_client, candidates) 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/magnum/common/octavia.py", line 51, in wait_for_lb_deleted 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server raise Exception("Timeout waiting for the load balancers " 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server Exception: Timeout waiting for the load balancers {'a193b831-9c7f-469e-aa9e-c8bc95777b19', 'd7af14d7-64ee-44 4e-9988-079c53661dcc'} to be deleted. 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred: 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args) 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args) 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/osprofiler/profiler.py", line 159, in wrapper 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server result = f(*args, **kwargs) 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/magnum/conductor/handlers/cluster_conductor.py", line 192, in clust er_delete 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server cluster_driver.delete_cluster(context, cluster) 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/magnum/drivers/heat/driver.py", line 152, in delete_cluster 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server self.pre_delete_cluster(context, cluster) 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/magnum/drivers/heat/driver.py", line 297, in pre_delete_cluster 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server octavia.delete_loadbalancers(context, cluster) 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/magnum/common/octavia.py", line 136, in delete_loadbalancers 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server raise exception.PreDeletionFailed(cluster_uuid=cluster.uuid, 2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server magnum.common.exception.PreDeletionFailed: Failed to pre-delete resources for cluster cd9d4497-8147-45e9-a9f 6-473754d1e290, error: Timeout waiting for the load balancers {'a193b831-9c7f-469e-aa9e-c8bc95777b19', 'd7af14d7-64ee-444e-9988-079c53661dcc'} to be deleted.. ----
It happens when deletion of the Octavia LB is in progress and results in the cluster state becoming DELETE_FAILED. When the cluster deleting is retried, after the LB has been actually deleted, the cluster is successfully deleted.
Michel
Le 21/08/2024 à 11:12, Michel Jouvin a écrit :
Hi,
I continue my tests with Magnum Caracal and found another strange behaviour during cluster creation. It goes through state CREATION_FAILED before reaching the state CREATION_COMPLETE (HEALTHY). It is very confusing.
Looking at Heat engine logs, it happens during load balancer creation, at the time the LB amphora has been created and Octavia is waiting to be able to connect it (typically one minue). During this period, Heat logs the following error:
-----------
2024-08-21 09:54:52.194 3200318 INFO heat.engine.resource [None req-610b24bd-7511-4d31-9f8c-b922162a7648 jouvin - - - - -] CREATE: LoadBalancer "loadbalancer" [670985ef-5808-4226-ad1b-b1244871de16] Stack "test-mj-1-30-m1n1-bis-ipufjomqmnnf-etcd_lb-d5fuqygvpp2j" [5c5905bf-f538-4a42-93ce-38d96b7c6603] 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource Traceback (most recent call last): 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource File "/usr/lib/python3.9/site-packages/heat/engine/resource.py", line 922, in _action_recorder 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource yield 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource File "/usr/lib/python3.9/site-packages/heat/engine/resource.py", line 1034, in _do_action 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource yield from self.action_handler_task(action, args=handler_args) 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource File "/usr/lib/python3.9/site-packages/heat/engine/resource.py", line 984, in action_handler_task 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource done = check(handler_data) 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource File "/usr/lib/python3.9/site-packages/heat/engine/resources/openstack/octavia/loadbalancer.py", line 169, in check_create_complete 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource return self._check_status() 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource File "/usr/lib/python3.9/site-packages/heat/engine/resources/openstack/octavia/octavia_base.py", line 29, in _check_status 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource raise exception.ResourceInError(resource_status=status) 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource heat.common.exception.ResourceInError: Went to status ERROR due to "Unknown" 2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource 2024-08-21 09:54:58.804 3200319 INFO heat.engine.scheduler [None req-610b24bd-7511-4d31-9f8c-b922162a7648 jouvin - - - - -] Task pause timed out 2024-08-21 09:54:59.340 3200318 INFO heat.engine.scheduler [None req-610b24bd-7511-4d31-9f8c-b922162a7648 jouvin - - - - -] Task pause timed out -----
that results in Magnum setting the state to CREATE_FAILED despite it is a transient state from Heat point of view and despite Heat stack creation continues successfully once the transient state is cleared and Magnum later updates the status to CREATE_COMPLETED. Unfortunately magnum-conductor log has nothing related to these events (it is filled with barbican_client related messages mainly).
Any idea what could cause this behaviour?
Best regards,
Michel