[openstack-dev] [tripleo][ironic] Hardware provisioning testing for Ocata

Justin Kilpatrick jkilpatr at redhat.com
Fri Jun 9 11:28:01 UTC 2017


On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsur <dtantsur at redhat.com> wrote:
> This number of "300", does it come from your testing or from other sources?
> If the former, which driver were you using? What exactly problems have you
> seen approaching this number?

I haven't encountered this issue personally, but talking to Joe
Talerico and some operators at summit around this number a single
conductor begins to fall behind polling all of the out of band
interfaces for the machines that it's responsible for. You start to
see what you would expect from polling running behind, like incorrect
power states listed for machines and a general inability to perform
machine operations in a timely manner.

Having spent some time at the Ironic operators form this is pretty
normal and the correct response is just to scale out conductors, this
is a problem with TripleO because we don't really have a scale out
option with a single machine design. Fortunately just increasing the
time between interface polling acts as a pretty good stopgap for this
and lets Ironic catch up.

I may get some time on a cloud of that scale in the future, at which
point I will have hard numbers to give you. One of the reasons I made
YODA was the frustrating prevalence of anecdotes instead of hard data
when it came to one of the most important parts of the user
experience. If it doesn't deploy people don't use it, full stop.

> Could you please elaborate? (a bug could also help). What exactly were you
> doing?

https://bugs.launchpad.net/ironic/+bug/1680725

Describes exactly what I'm experiencing. Essentially the problem is
that nodes can and do fail to pxe, then cleaning fails and you just
lose the nodes. Users have to spend time going back and babysitting
these nodes and there's no good instructions on what to do with failed
nodes anyways. The answer is move them to manageable and then to
available at which point they go back into cleaning until it finally
works.

Like introspection was a year ago this is a cavalcade of documentation
problems and software issues. I mean really everything *works*
technically but the documentation acts like cleaning will work all the
time and so does the software, leaving the user to figure out how to
accommodate the realities of the situation without so much as a
warning that it might happen.

This comes out as more of a ux issue than a software one, but we can't
just ignore these.

- Justin



More information about the OpenStack-dev mailing list