[openstack-dev] [tripleo] Suggestions for OOO

Joe Talerico jtaleric at redhat.com
Tue Oct 11 00:39:13 UTC 2016


Hey all,
The past couple of days I have making comments on IRC to discuss some
of the issues I have bumped into when scaling Newton to > 30 compute
nodes.

- `bulk import`, the operation to go from enroll -> manage can take
20-30 minutes to complete. Can we have this be a non-blocking
operation with a message to the user that they cannot continue until
the nodes they want to deploy on go from enroll->manage?
- overcloud deploy - when pxe completes I have seen a hand-full of
nodes not reboot, or just get jammed up in the pxe screen. When this
occurs I run:
$ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
'{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
awk '{print $2}') | awk '{print $2}') off ; fi; done
# (192 is the first octet)
- Then -
$ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
'{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
awk '{print $2}') | awk '{print $2}') on ; fi; done

This typically fixes the deployment so things can continue, however it
would be great to have this type of logic added to OOO, where if a
node goes from BUILD->ACTIVE, if it isn't reachable in 120 seconds,
ironic simply reboots the host..

Also, I suggest if the second attempt fails, reschedule the host --
sometimes I have seen where a raid controller or something goes bad
out of our control.

Thanks for listening!
rook



More information about the OpenStack-dev mailing list