[openstack-dev] [tripleo] Suggestions for OOO

Joe Talerico jtaleric at redhat.com
Tue Oct 11 11:30:04 UTC 2016


On Tue, Oct 11, 2016 at 6:38 AM, Dmitry Tantsur <dtantsur at redhat.com> wrote:
> On 10/11/2016 02:39 AM, Joe Talerico wrote:
>>
>> Hey all,
>> The past couple of days I have making comments on IRC to discuss some
>> of the issues I have bumped into when scaling Newton to > 30 compute
>> nodes.
>>
>> - `bulk import`, the operation to go from enroll -> manage can take
>> 20-30 minutes to complete. Can we have this be a non-blocking
>> operation with a message to the user that they cannot continue until
>> the nodes they want to deploy on go from enroll->manage?
>
>
> The only thing that enroll->manage does is to check the power credentials.
> It should never take more than 30-60 seconds (and even this is too much, and
> might be a sign of problems with the environment). I suspect that the
> workflow processes nodes sequentially, though, hence these 30-60 seconds
> multiply by the number of nodes. If so, the workflow definitely needs
> fixing.

Yeah, it seems to be sequentially, and I did have 2 nodes that failed
to go from enroll->manage, which could slow things down even more.

>
>> - overcloud deploy - when pxe completes I have seen a hand-full of
>> nodes not reboot, or just get jammed up in the pxe screen. When this
>> occurs I run:
>> $ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
>> '{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
>> node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
>> awk '{print $2}') | awk '{print $2}') off ; fi; done
>> # (192 is the first octet)
>> - Then -
>> $ for i in `nova list | grep -i 192 | awk '{print $12}' | awk -F=
>> '{print $2}'`; do if [[ $(ping -c 1 $i | grep "100%") ]]; then ironic
>> node-set-power-state $(ironic node-list | grep $(nova list | grep $i |
>> awk '{print $2}') | awk '{print $2}') on ; fi; done
>>
>> This typically fixes the deployment so things can continue, however it
>> would be great to have this type of logic added to OOO, where if a
>> node goes from BUILD->ACTIVE, if it isn't reachable in 120 seconds,
>> ironic simply reboots the host..
>
>
> Unfortunately, it's hard to define "reachable". Also 120 seconds is way too
> little for some servers, it can well take them 5 minutes to boot.

Sure, 120 was just a shot in the dark to start the conversation, we
need to establish some sort of timeout.

>
> I would rather figure out why PXE gets stuck on your environment. Maybe you
> need a firmware update.

The issue is that things are inconsistent and across multiple
platforms. I have seen this on Dell, HP and Supermicro -- and while
one deployment fails, if I re-try the deployment it works.

>
>>
>> Also, I suggest if the second attempt fails, reschedule the host --
>> sometimes I have seen where a raid controller or something goes bad
>> out of our control.
>
>
> We do have reschedule in place, but I suspect the current Ironic timeout (1
> hour?) is too large for Nova.

Possibly? I didn't think it would reschedule if the nodes goes from
build->active.. For example, the user is going through a install, the
PXE went through, reboots and now the raid battery is dead, or
something with the raid controller went fubar, the user is afk and
doesn't see this, his deployment fails.

I am not suggestion we handle every corner case, but the situations
above happen to often to ignore.


>
>>
>> Thanks for listening!
>> rook
>>
>> __________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



More information about the OpenStack-dev mailing list