[openstack-dev] [TripleO/heat] openstack debug command

Steven Hardy shardy at redhat.com
Mon Nov 30 21:28:49 UTC 2015


On Tue, Dec 01, 2015 at 08:47:20AM +1300, Steve Baker wrote:
> On 30/11/15 23:21, Steven Hardy wrote:
> >On Mon, Nov 30, 2015 at 10:03:29AM +0100, Lennart Regebro wrote:
> >>I'm tasked to implement a command that shows error messages when a
> >>deployment has failed. I have a vague memory of having seen scripts
> >>that do something like this, if that exists, can somebody point me in
> >>teh right direction?
> >I wrote a super simple script and put it in a blog post a while back:
> >
> >http://hardysteven.blogspot.co.uk/2015/05/tripleo-heat-templates-part-3-cluster.html
> >
> >All it does is find the failed SoftwareDeployment resources, then do heat
> >deployment-show on the resource, so you can see the stderr associated with
> >the failure.
> >
> >Having tripleoclient do that by default would be useful.
> >
> >>Any opinions on what that should do, specifically? Traverse failed
> >>resources to find error messages, I assume. Anything else?
> >Yeah, but I think for this to be useful, we need to go a bit deeper than
> >just showing the resource error - there are a number of typical failure
> >modes, and I end up repeating the same steps to debug every time.
> >
> >1. SoftwareDeployment failed (mentioned above).  Every time, you need to
> >see the name of the SoftwareDeployment which failed, figure out if it
> >failed on one or all of the servers, then look at the stderr for clues.
> >
> >2. A server failed to build (OS::Nova::Server resource is FAILED), here we
> >need to check both nova and ironic, looking first to see if ironic has the
> >node(s) in the wrong state for scheduling (e.g nova gave us a no valid
> >host error), and then if they are OK in ironic, do nova show on the failed
> >host to see the reason nova gives us for it failing to go ACTIVE.
> >
> >3. A stack timeout happened.  IIRC when this happens, we currently fail
> >with an obscure keystone related backtrace due to the token expiring.  We
> >should instead catch this error and show the heat stack status_reason,
> >which should say clearly the stack timed out.
> >
> >If we could just make these three cases really clear and easy to debug, I
> >think things would be much better (IME the above are a high proportion of
> >all failures), but I'm sure folks can come up with other ideas to add to
> >the list.
> >
> I'm actually drafting a spec which includes a command which does this. I
> hope to submit it soon, but here is the current state of that command's
> description:
> 
> Diagnosing resources in a FAILED state
> --------------------------------------
> 
> One command will be implemented:
> - openstack overcloud failed list
> 
> This will print a yaml tree showing the hierarchy of nested stacks until it
> gets to the actual failed resource, then it will show information regarding
> the
> failure. For most resource types this information will be the status_reason,
> but for software-deployment resources the deploy_stdout, deploy_stderr and
> deploy_status code will be printed.
> 
> In addition to this stand-alone command, this output will also be printed
> when
> an ``openstack overcloud deploy`` or ``openstack overcloud update`` command
> results in a stack in a FAILED state.

This sounds great!

Another piece of low-hanging-fruit in the meantime is we should actually
print the stack_status_reason on failure:

https://github.com/openstack/python-tripleoclient/blob/master/tripleoclient/v1/overcloud_deploy.py#L280

The DeploymentError raised could include the stack_status_reason vs the
unqualified "Heat Stack create failed".

I guess your event listing partially overlaps with this, as you can now
derive the stack_status_reason from the last event, but it's still be good
to loudly output it so folks can see more quickly when things such as
timeouts happen that are clearly displayed in the top-level stack status.

Steve



More information about the OpenStack-dev mailing list