Re: Magnum Caracal: cluster creation goes through state CREATION_FAILED before CREATION_COMPLETE

21 Aug 2024


      It looks like you have problems with Octavia service that is responsible 
for creating LB for API Server and etcd.
Verify Octavia logs and check if creating a LB manually (openstack 
CLI/GUI) works correctly in your environment.

Kind regards,
Paweł

W dniu 21.08.2024 o 11:24, Michel Jouvin pisze:
...
Additional information: it seems a similar issue happens when deleting 
a cluster leading to a deletion failure (cluster status = 
DELETE_FAILED) at first attempt, requiring a second attempt to delete 
the cluster successfully.
During first attempt, the following exception occurs in 
magnum-conductor logs:
-----
087 6a7221ba840247cdae245d7a61e0023b - - 
c0134d6e05ce4401ab78951ff1f2604c -] Exception during message handling: 
magnum.common.exception.PreDeletionFailed: Failed to pre-del
ete resources for cluster cd9d4497-8147-45e9-a9f6-473754d1e290, error: 
Timeout waiting for the load balancers 
{'a193b831-9c7f-469e-aa9e-c8bc95777b19', 'd7af14d7-64ee-444e-9
988-079c53661dcc'} to be deleted..
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server 
Traceback (most recent call last):
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File 
"/usr/lib/python3.9/site-packages/magnum/common/octavia.py", line 134, 
in delete_loadbalancers
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server     
wait_for_lb_deleted(octavia_client, candidates)
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File 
"/usr/lib/python3.9/site-packages/magnum/common/octavia.py", line 51, 
in wait_for_lb_deleted
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server     
raise Exception("Timeout waiting for the load balancers "
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server 
Exception: Timeout waiting for the load balancers 
{'a193b831-9c7f-469e-aa9e-c8bc95777b19', 'd7af14d7-64ee-44
4e-9988-079c53661dcc'} to be deleted.
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server During 
handling of the above exception, another exception occurred:
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server 
Traceback (most recent call last):
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File 
"/usr/lib/python3.9/site-packages/oslo_messaging/rpc/server.py", line 
165, in _process_incoming
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server     
res = self.dispatcher.dispatch(message)
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File 
"/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", 
line 309, in dispatch
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server     
return self._do_dispatch(endpoint, method, ctxt, args)
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File 
"/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", 
line 229, in _do_dispatch
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server     
result = func(ctxt, **new_args)
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File 
"/usr/lib/python3.9/site-packages/osprofiler/profiler.py", line 159, 
in wrapper
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server     
result = f(*args, **kwargs)
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File 
"/usr/lib/python3.9/site-packages/magnum/conductor/handlers/cluster_conductor.py", 
line 192, in clust
er_delete
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server 
cluster_driver.delete_cluster(context, cluster)
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File 
"/usr/lib/python3.9/site-packages/magnum/drivers/heat/driver.py", line 
152, in delete_cluster
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server     
self.pre_delete_cluster(context, cluster)
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File 
"/usr/lib/python3.9/site-packages/magnum/drivers/heat/driver.py", line 
297, in pre_delete_cluster
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server 
octavia.delete_loadbalancers(context, cluster)
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server File 
"/usr/lib/python3.9/site-packages/magnum/common/octavia.py", line 136, 
in delete_loadbalancers
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server     
raise exception.PreDeletionFailed(cluster_uuid=cluster.uuid,
2024-08-21 11:17:05.731 3085140 ERROR oslo_messaging.rpc.server 
magnum.common.exception.PreDeletionFailed: Failed to pre-delete 
resources for cluster cd9d4497-8147-45e9-a9f
6-473754d1e290, error: Timeout waiting for the load balancers 
{'a193b831-9c7f-469e-aa9e-c8bc95777b19', 
'd7af14d7-64ee-444e-9988-079c53661dcc'} to be deleted..
----
It happens when deletion of the Octavia LB is in progress and results 
in the cluster state becoming DELETE_FAILED. When the cluster deleting 
is retried, after the LB has been actually deleted, the cluster is 
successfully deleted.
Michel
Le 21/08/2024 à 11:12, Michel Jouvin a écrit :
...
Hi,
I continue my tests with Magnum Caracal and found another strange 
behaviour during cluster creation. It goes through state 
CREATION_FAILED before reaching the state CREATION_COMPLETE 
(HEALTHY). It is very confusing.
Looking at Heat engine logs, it happens during load balancer 
creation, at the time the LB amphora has been created and Octavia is 
waiting to be able to connect it (typically one minue). During this 
period, Heat logs the following error:
-----------
2024-08-21 09:54:52.194 3200318 INFO heat.engine.resource [None 
req-610b24bd-7511-4d31-9f8c-b922162a7648 jouvin - - - - -] CREATE: 
LoadBalancer "loadbalancer" [670985ef-5808-4226-ad1b-b1244871de16] 
Stack "test-mj-1-30-m1n1-bis-ipufjomqmnnf-etcd_lb-d5fuqygvpp2j" 
[5c5905bf-f538-4a42-93ce-38d96b7c6603]
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource Traceback 
(most recent call last):
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource File 
"/usr/lib/python3.9/site-packages/heat/engine/resource.py", line 922, 
in _action_recorder
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource yield
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource File 
"/usr/lib/python3.9/site-packages/heat/engine/resource.py", line 
1034, in _do_action
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource yield from 
self.action_handler_task(action, args=handler_args)
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource File 
"/usr/lib/python3.9/site-packages/heat/engine/resource.py", line 984, 
in action_handler_task
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource done = 
check(handler_data)
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource File 
"/usr/lib/python3.9/site-packages/heat/engine/resources/openstack/octavia/loadbalancer.py", 
line 169, in check_create_complete
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource return 
self._check_status()
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource File 
"/usr/lib/python3.9/site-packages/heat/engine/resources/openstack/octavia/octavia_base.py", 
line 29, in _check_status
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource raise 
exception.ResourceInError(resource_status=status)
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource 
heat.common.exception.ResourceInError: Went to status ERROR due to 
"Unknown"
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource
2024-08-21 09:54:58.804 3200319 INFO heat.engine.scheduler [None 
req-610b24bd-7511-4d31-9f8c-b922162a7648 jouvin - - - - -] Task pause 
timed out
2024-08-21 09:54:59.340 3200318 INFO heat.engine.scheduler [None 
req-610b24bd-7511-4d31-9f8c-b922162a7648 jouvin - - - - -] Task pause 
timed out
-----
that results in Magnum setting the state to CREATE_FAILED despite it 
is a transient state from Heat point of view and despite Heat stack 
creation continues successfully once the transient state is cleared 
and Magnum later updates the status to CREATE_COMPLETED. 
Unfortunately magnum-conductor log has nothing related to these 
events (it is filled with barbican_client related messages mainly).
Any idea what could cause this behaviour?
Best regards,
Michel

Re: Magnum Caracal: cluster creation goes through state CREATION_FAILED before CREATION_COMPLETE

Paweł Kubica