Open Stack

Tue Oct 11 10:38:34 UTC 2016

On 10/11/2016 02:39 AM, Joe Talerico wrote:
> Hey all,
> The past couple of days I have making comments on IRC to discuss some
> of the issues I have bumped into when scaling Newton to > 30 compute
> nodes.
>
> - `bulk import`, the operation to go from enroll -> manage can take
> 20-30 minutes to complete. Can we have this be a non-blocking
> operation with a message to the user that they cannot continue until
> the nodes they want to deploy on go from enroll->manage?

The only thing that enroll->manage does is to check the power credentials. It 
should never take more than 30-60 seconds (and even this is too much, and might 
be a sign of problems with the environment). I suspect that the workflow 
processes nodes sequentially, though, hence these 30-60 seconds multiply by the 
number of nodes. If so, the workflow definitely needs fixing.

> - overcloud deploy - when pxe completes I have seen a hand-full of
> nodes not reboot, or just get jammed up in the pxe screen. When this
> occurs I run:
> $ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
> '{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
> node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
> awk '{print $2}') | awk '{print $2}') off ; fi; done
> # (192 is the first octet)
> - Then -
> $ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
> '{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
> node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
> awk '{print $2}') | awk '{print $2}') on ; fi; done
>
> This typically fixes the deployment so things can continue, however it
> would be great to have this type of logic added to OOO, where if a
> node goes from BUILD->ACTIVE, if it isn't reachable in 120 seconds,
> ironic simply reboots the host..

Unfortunately, it's hard to define "reachable". Also 120 seconds is way too 
little for some servers, it can well take them 5 minutes to boot.

I would rather figure out why PXE gets stuck on your environment. Maybe you need 
a firmware update.

>
> Also, I suggest if the second attempt fails, reschedule the host --
> sometimes I have seen where a raid controller or something goes bad
> out of our control.

We do have reschedule in place, but I suspect the current Ironic timeout (1 
hour?) is too large for Nova.

>
> Thanks for listening!
> rook
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

Open Stack

[openstack-dev] [tripleo] Suggestions for OOO

OpenStack

Community

Documentation

Branding & Legal