Open Stack

Tue Dec 1 02:39:44 UTC 2015

On 01/12/15 15:39, Steve Baker wrote:
> On 01/12/15 10:28, Steven Hardy wrote:
>> On Tue, Dec 01, 2015 at 08:47:20AM +1300, Steve Baker wrote:
>>> On 30/11/15 23:21, Steven Hardy wrote:
>>>> On Mon, Nov 30, 2015 at 10:03:29AM +0100, Lennart Regebro wrote:
>>>>> I'm tasked to implement a command that shows error messages when a
>>>>> deployment has failed. I have a vague memory of having seen scripts
>>>>> that do something like this, if that exists, can somebody point me in
>>>>> teh right direction?
>>>> I wrote a super simple script and put it in a blog post a while back:
>>>>
>>>> http://hardysteven.blogspot.co.uk/2015/05/tripleo-heat-templates-part-3-cluster.html 
>>>>
>>>>
>>>> All it does is find the failed SoftwareDeployment resources, then 
>>>> do heat
>>>> deployment-show on the resource, so you can see the stderr 
>>>> associated with
>>>> the failure.
>>>>
>>>> Having tripleoclient do that by default would be useful.
>>>>
>>>>> Any opinions on what that should do, specifically? Traverse failed
>>>>> resources to find error messages, I assume. Anything else?
>>>> Yeah, but I think for this to be useful, we need to go a bit deeper 
>>>> than
>>>> just showing the resource error - there are a number of typical 
>>>> failure
>>>> modes, and I end up repeating the same steps to debug every time.
>>>>
>>>> 1. SoftwareDeployment failed (mentioned above).  Every time, you 
>>>> need to
>>>> see the name of the SoftwareDeployment which failed, figure out if it
>>>> failed on one or all of the servers, then look at the stderr for 
>>>> clues.
>>>>
>>>> 2. A server failed to build (OS::Nova::Server resource is FAILED), 
>>>> here we
>>>> need to check both nova and ironic, looking first to see if ironic 
>>>> has the
>>>> node(s) in the wrong state for scheduling (e.g nova gave us a no valid
>>>> host error), and then if they are OK in ironic, do nova show on the 
>>>> failed
>>>> host to see the reason nova gives us for it failing to go ACTIVE.
>>>>
>>>> 3. A stack timeout happened.  IIRC when this happens, we currently 
>>>> fail
>>>> with an obscure keystone related backtrace due to the token 
>>>> expiring.  We
>>>> should instead catch this error and show the heat stack status_reason,
>>>> which should say clearly the stack timed out.
>>>>
>>>> If we could just make these three cases really clear and easy to 
>>>> debug, I
>>>> think things would be much better (IME the above are a high 
>>>> proportion of
>>>> all failures), but I'm sure folks can come up with other ideas to 
>>>> add to
>>>> the list.
>>>>
>>> I'm actually drafting a spec which includes a command which does 
>>> this. I
>>> hope to submit it soon, but here is the current state of that command's
>>> description:
>>>
>>> Diagnosing resources in a FAILED state
>>> --------------------------------------
>>>
>>> One command will be implemented:
>>> - openstack overcloud failed list
>>>
>>> This will print a yaml tree showing the hierarchy of nested stacks 
>>> until it
>>> gets to the actual failed resource, then it will show information 
>>> regarding
>>> the
>>> failure. For most resource types this information will be the 
>>> status_reason,
>>> but for software-deployment resources the deploy_stdout, 
>>> deploy_stderr and
>>> deploy_status code will be printed.
>>>
>>> In addition to this stand-alone command, this output will also be 
>>> printed
>>> when
>>> an ``openstack overcloud deploy`` or ``openstack overcloud update`` 
>>> command
>>> results in a stack in a FAILED state.
>> This sounds great!
> The spec is here.
I mean _here_

https://review.openstack.org/#/c/251587/

Open Stack

[openstack-dev] [TripleO/heat] openstack debug command

OpenStack

Community

Documentation

Branding & Legal