Hey Zane,

Thank you so much for the details - super interesting. We've worked with the Vendor to try and reproduce while we had our logs for Heat turned to DEBUG. Unfortunately, all of the creations they have attempted since have worked. It first failed 4 times out of 5 and has since worked...

It's one of those problems! We'll keep trying to reproduce. Just to be sure, the actual yaml is stored in the DB and then accessed to create the actual Heat ressources? 

Thanks!

On Wed, Jul 22, 2020 at 3:46 PM Zane Bitter <zbitter@redhat.com> wrote:
On 21/07/20 8:03 pm, Laurent Dumont wrote:
> Hi!
>
> We are currently troubleshooting a Heat stack issue where one of the
> stack (one of 25 or so) is failing to be created properly (seemingly
> randomly).
>
> The actual error returned by Heat is quite strange and Google has been
> quite sparse in terms of references.
>
> The actual error looks like the following (I've sanitized some of the
> names):
>
> Resource CREATE failed: resources.potato: Resource CREATE failed:
> resources[0]: raw template with id 22273 not found

When creating a nested stack, rather than just calling the RPC method to
create a new stack, Heat stores the template in the database first and
passes the ID in the RPC message.[1] (It turns out that by doing it this
way we can save massive amounts of memory when processing a large tree
of nested stacks.) My best guess is that this message indicates that the
template row has been deleted by the time the other engine goes to look
at it.

I don't see how you could have got an ID like 22273 without the template
having been successfully stored at some point.

The template is only supposed to be deleted if the RPC call returns with
an error.[2] The only way I can think of for that to happen before an
attempt to create the child stack is if the RPC call times out, but the
original message is eventually picked up by an engine. I would check
your logs for RPC timeouts and consider increasing them.

What does the status_reason look like at one level above in the tree?
That should indicate the first error that caused the template to be deleted.

>     heat resource-list STACK_NAME_HERE -n 50
>     +------------------+--------------------------------------+-------------------------+-----------------+----------------------+--------------------------------------------------------------------------+
>     | resource_name    | physical_resource_id                 |
>     resource_type           | resource_status | updated_time         |
>     stack_name                                                         
>          |
>     +------------------+--------------------------------------+-------------------------+-----------------+----------------------+--------------------------------------------------------------------------+
>     | potato              | RESOURCE_ID_HERE | OS::Heat::ResourceGroup |
>     CREATE_FAILED   | 2020-07-18 T19:52:10Z |
>     nested_stack_1_STACK_NAME_HERE                  |
>     | potato_server_group | RESOURCE_ID_HERE | OS::Nova::ServerGroup   |
>     CREATE_COMPLETE | 2020-07-21T19:52:10Z |
>     nested_stack_1_STACK_NAME_HERE                  |
>     | 0                |                                      |
>     potato1.yaml     | CREATE_FAILED   | 2020-07-18T19:52:12Z |
>     nested_stack_2_STACK_NAME_HERE |
>     | 1                |                                      |
>     potato1.yaml     | INIT_COMPLETE   | 2020-07- 18 T19:52:12Z |
>     nested_stack_2_STACK_NAME_HERE |
>     +------------------+--------------------------------------+-------------------------+-----------------+----------------------+--------------------------------------------------------------------------+
>
>
> The template itself is pretty simple and attempts to create a
> ServerGroup and 2 VMs (as part of the ResourceGroup). My feeling is that
> one the creation of those machines fails and Heat get's a little cooky
> and returns an error that might not be the actual root cause. I would
> have expected the VM to show up in the resource list but I just see the
> source "yaml".

It's clear from the above output that the scaled unit of the resource
group is in fact a template (not an OS::Nova::Server), and the error is
occurring trying to create a stack from that template (potato1.yaml) -
before Heat even has a chance to start creating the server.

> Has anyone seen something similar in the past?

Nope.

cheers,
Zane.

[1]
https://opendev.org/openstack/heat/src/branch/master/heat/engine/resources/stack_resource.py#L367-L384
[2]
https://opendev.org/openstack/heat/src/branch/master/heat/engine/resources/stack_resource.py#L335-L342