[openstack-dev] [tripleo][ironic] Hardware provisioning testing for Ocata
mark at stackhpc.com
Fri Jun 9 14:16:32 UTC 2017
This is great information Justin, thanks for sharing. It will prove useful
as we scale up our ironic deployments.
It seems to me that a reference configuration of ironic would be a useful
resource for many people. Some key decisions affecting scalability and
performance may at first seem arbitrary but have an impact on performance
and scalability, such as:
- BIOS vs. UEFI
- PXE vs. iPXE bootloader
- TFTP vs. HTTP for kernel/ramdisk transfer
- iSCSI vs. Swift (or one day standalone HTTP?) for image transfer
- Hardware specific drivers vs. IPMI
- Local boot vs. netboot
- Fat images vs. slim + post-configuration
- Any particularly useful configuration tunables (power state polling
interval, nova build concurrency, others?)
I personally use kolla + kolla-ansible which by default uses PXE + TFTP +
iSCSI which is arguably not the best combination.
On 9 June 2017 at 12:28, Justin Kilpatrick <jkilpatr at redhat.com> wrote:
> On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsur <dtantsur at redhat.com>
> > This number of "300", does it come from your testing or from other
> > If the former, which driver were you using? What exactly problems have
> > seen approaching this number?
> I haven't encountered this issue personally, but talking to Joe
> Talerico and some operators at summit around this number a single
> conductor begins to fall behind polling all of the out of band
> interfaces for the machines that it's responsible for. You start to
> see what you would expect from polling running behind, like incorrect
> power states listed for machines and a general inability to perform
> machine operations in a timely manner.
> Having spent some time at the Ironic operators form this is pretty
> normal and the correct response is just to scale out conductors, this
> is a problem with TripleO because we don't really have a scale out
> option with a single machine design. Fortunately just increasing the
> time between interface polling acts as a pretty good stopgap for this
> and lets Ironic catch up.
> I may get some time on a cloud of that scale in the future, at which
> point I will have hard numbers to give you. One of the reasons I made
> YODA was the frustrating prevalence of anecdotes instead of hard data
> when it came to one of the most important parts of the user
> experience. If it doesn't deploy people don't use it, full stop.
> > Could you please elaborate? (a bug could also help). What exactly were
> > doing?
> Describes exactly what I'm experiencing. Essentially the problem is
> that nodes can and do fail to pxe, then cleaning fails and you just
> lose the nodes. Users have to spend time going back and babysitting
> these nodes and there's no good instructions on what to do with failed
> nodes anyways. The answer is move them to manageable and then to
> available at which point they go back into cleaning until it finally
> Like introspection was a year ago this is a cavalcade of documentation
> problems and software issues. I mean really everything *works*
> technically but the documentation acts like cleaning will work all the
> time and so does the software, leaving the user to figure out how to
> accommodate the realities of the situation without so much as a
> warning that it might happen.
> This comes out as more of a ux issue than a software one, but we can't
> just ignore these.
> - Justin
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: OpenStack-dev-request at lists.openstack.org?subject:unsubscribe
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OpenStack-dev