Magnum Caracal: cluster creation goes through state CREATION_FAILED before CREATION_COMPLETE

21 Aug 2024

      Hi,

I continue my tests with Magnum Caracal and found another strange 
behaviour during cluster creation. It goes through state CREATION_FAILED 
before reaching the state CREATION_COMPLETE (HEALTHY). It is very confusing.

Looking at Heat engine logs, it happens during load balancer creation, 
at the time the LB amphora has been created and Octavia is waiting to be 
able to connect it (typically one minue). During this period, Heat logs 
the following error:

-----------

2024-08-21 09:54:52.194 3200318 INFO heat.engine.resource [None 
req-610b24bd-7511-4d31-9f8c-b922162a7648 jouvin - - - - -] CREATE: 
LoadBalancer "loadbalancer" [670985ef-5808-4226-ad1b-b1244871de16] Stack 
"test-mj-1-30-m1n1-bis-ipufjomqmnnf-etcd_lb-d5fuqygvpp2j" 
[5c5905bf-f538-4a42-93ce-38d96b7c6603]
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource Traceback 
(most recent call last):
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource   File 
"/usr/lib/python3.9/site-packages/heat/engine/resource.py", line 922, in 
_action_recorder
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource yield
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource   File 
"/usr/lib/python3.9/site-packages/heat/engine/resource.py", line 1034, 
in _do_action
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource yield from 
self.action_handler_task(action, args=handler_args)
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource   File 
"/usr/lib/python3.9/site-packages/heat/engine/resource.py", line 984, in 
action_handler_task
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource done = 
check(handler_data)
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource   File 
"/usr/lib/python3.9/site-packages/heat/engine/resources/openstack/octavia/loadbalancer.py", 
line 169, in check_create_complete
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource return 
self._check_status()
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource   File 
"/usr/lib/python3.9/site-packages/heat/engine/resources/openstack/octavia/octavia_base.py", 
line 29, in _check_status
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource raise 
exception.ResourceInError(resource_status=status)
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource 
heat.common.exception.ResourceInError: Went to status ERROR due to "Unknown"
2024-08-21 09:54:52.194 3200318 ERROR heat.engine.resource
2024-08-21 09:54:58.804 3200319 INFO heat.engine.scheduler [None 
req-610b24bd-7511-4d31-9f8c-b922162a7648 jouvin - - - - -] Task pause 
timed out
2024-08-21 09:54:59.340 3200318 INFO heat.engine.scheduler [None 
req-610b24bd-7511-4d31-9f8c-b922162a7648 jouvin - - - - -] Task pause 
timed out
-----

that results in Magnum setting the state to CREATE_FAILED despite it is 
a transient state from Heat point of view and despite Heat stack 
creation continues successfully once the transient state is cleared and 
Magnum later updates the status to CREATE_COMPLETED. Unfortunately 
magnum-conductor log has nothing related to these events (it is filled 
with barbican_client related messages mainly).

Any idea what could cause this behaviour?

Best regards,

Michel

Magnum Caracal: cluster creation goes through state CREATION_FAILED before CREATION_COMPLETE

Michel Jouvin