[openstack-dev] [TripleO/heat] openstack debug command
sbaker at redhat.com
Tue Dec 1 02:39:05 UTC 2015
On 01/12/15 10:28, Steven Hardy wrote:
> On Tue, Dec 01, 2015 at 08:47:20AM +1300, Steve Baker wrote:
>> On 30/11/15 23:21, Steven Hardy wrote:
>>> On Mon, Nov 30, 2015 at 10:03:29AM +0100, Lennart Regebro wrote:
>>>> I'm tasked to implement a command that shows error messages when a
>>>> deployment has failed. I have a vague memory of having seen scripts
>>>> that do something like this, if that exists, can somebody point me in
>>>> teh right direction?
>>> I wrote a super simple script and put it in a blog post a while back:
>>> All it does is find the failed SoftwareDeployment resources, then do heat
>>> deployment-show on the resource, so you can see the stderr associated with
>>> the failure.
>>> Having tripleoclient do that by default would be useful.
>>>> Any opinions on what that should do, specifically? Traverse failed
>>>> resources to find error messages, I assume. Anything else?
>>> Yeah, but I think for this to be useful, we need to go a bit deeper than
>>> just showing the resource error - there are a number of typical failure
>>> modes, and I end up repeating the same steps to debug every time.
>>> 1. SoftwareDeployment failed (mentioned above). Every time, you need to
>>> see the name of the SoftwareDeployment which failed, figure out if it
>>> failed on one or all of the servers, then look at the stderr for clues.
>>> 2. A server failed to build (OS::Nova::Server resource is FAILED), here we
>>> need to check both nova and ironic, looking first to see if ironic has the
>>> node(s) in the wrong state for scheduling (e.g nova gave us a no valid
>>> host error), and then if they are OK in ironic, do nova show on the failed
>>> host to see the reason nova gives us for it failing to go ACTIVE.
>>> 3. A stack timeout happened. IIRC when this happens, we currently fail
>>> with an obscure keystone related backtrace due to the token expiring. We
>>> should instead catch this error and show the heat stack status_reason,
>>> which should say clearly the stack timed out.
>>> If we could just make these three cases really clear and easy to debug, I
>>> think things would be much better (IME the above are a high proportion of
>>> all failures), but I'm sure folks can come up with other ideas to add to
>>> the list.
>> I'm actually drafting a spec which includes a command which does this. I
>> hope to submit it soon, but here is the current state of that command's
>> Diagnosing resources in a FAILED state
>> One command will be implemented:
>> - openstack overcloud failed list
>> This will print a yaml tree showing the hierarchy of nested stacks until it
>> gets to the actual failed resource, then it will show information regarding
>> failure. For most resource types this information will be the status_reason,
>> but for software-deployment resources the deploy_stdout, deploy_stderr and
>> deploy_status code will be printed.
>> In addition to this stand-alone command, this output will also be printed
>> an ``openstack overcloud deploy`` or ``openstack overcloud update`` command
>> results in a stack in a FAILED state.
> This sounds great!
The spec is here.
> Another piece of low-hanging-fruit in the meantime is we should actually
> print the stack_status_reason on failure:
> The DeploymentError raised could include the stack_status_reason vs the
> unqualified "Heat Stack create failed".
> I guess your event listing partially overlaps with this, as you can now
> derive the stack_status_reason from the last event, but it's still be good
> to loudly output it so folks can see more quickly when things such as
> timeouts happen that are clearly displayed in the top-level stack status.
Yes, this would be a trivially implemented quick win.
More information about the OpenStack-dev