[openstack-dev] [TripleO/heat] openstack debug command

Steve Baker sbaker at redhat.com
Mon Nov 30 19:47:20 UTC 2015


On 30/11/15 23:21, Steven Hardy wrote:
> On Mon, Nov 30, 2015 at 10:03:29AM +0100, Lennart Regebro wrote:
>> I'm tasked to implement a command that shows error messages when a
>> deployment has failed. I have a vague memory of having seen scripts
>> that do something like this, if that exists, can somebody point me in
>> teh right direction?
> I wrote a super simple script and put it in a blog post a while back:
>
> http://hardysteven.blogspot.co.uk/2015/05/tripleo-heat-templates-part-3-cluster.html
>
> All it does is find the failed SoftwareDeployment resources, then do heat
> deployment-show on the resource, so you can see the stderr associated with
> the failure.
>
> Having tripleoclient do that by default would be useful.
>
>> Any opinions on what that should do, specifically? Traverse failed
>> resources to find error messages, I assume. Anything else?
> Yeah, but I think for this to be useful, we need to go a bit deeper than
> just showing the resource error - there are a number of typical failure
> modes, and I end up repeating the same steps to debug every time.
>
> 1. SoftwareDeployment failed (mentioned above).  Every time, you need to
> see the name of the SoftwareDeployment which failed, figure out if it
> failed on one or all of the servers, then look at the stderr for clues.
>
> 2. A server failed to build (OS::Nova::Server resource is FAILED), here we
> need to check both nova and ironic, looking first to see if ironic has the
> node(s) in the wrong state for scheduling (e.g nova gave us a no valid
> host error), and then if they are OK in ironic, do nova show on the failed
> host to see the reason nova gives us for it failing to go ACTIVE.
>
> 3. A stack timeout happened.  IIRC when this happens, we currently fail
> with an obscure keystone related backtrace due to the token expiring.  We
> should instead catch this error and show the heat stack status_reason,
> which should say clearly the stack timed out.
>
> If we could just make these three cases really clear and easy to debug, I
> think things would be much better (IME the above are a high proportion of
> all failures), but I'm sure folks can come up with other ideas to add to
> the list.
>
I'm actually drafting a spec which includes a command which does this. I 
hope to submit it soon, but here is the current state of that command's 
description:

Diagnosing resources in a FAILED state
--------------------------------------

One command will be implemented:
- openstack overcloud failed list

This will print a yaml tree showing the hierarchy of nested stacks until it
gets to the actual failed resource, then it will show information 
regarding the
failure. For most resource types this information will be the status_reason,
but for software-deployment resources the deploy_stdout, deploy_stderr and
deploy_status code will be printed.

In addition to this stand-alone command, this output will also be 
printed when
an ``openstack overcloud deploy`` or ``openstack overcloud update`` command
results in a stack in a FAILED state.




More information about the OpenStack-dev mailing list