<div dir="ltr">This is great information Justin, thanks for sharing. It will prove useful as we scale up our ironic deployments.<div><br></div><div>It seems to me that a reference configuration of ironic would be a useful resource for many people. Some key decisions affecting scalability and performance may at first seem arbitrary but have an impact on performance and scalability, such as:</div><div><br></div><div>- BIOS vs. UEFI</div><div>- PXE vs. iPXE bootloader</div><div>- TFTP vs. HTTP for kernel/ramdisk transfer</div><div>- iSCSI vs. Swift (or one day standalone HTTP?) for image transfer</div><div>- Hardware specific drivers vs. IPMI</div><div>- Local boot vs. netboot</div><div>- Fat images vs. slim + post-configuration</div><div>- Any particularly useful configuration tunables (power state polling interval, nova build concurrency, others?)</div><div><br></div><div>I personally use kolla + kolla-ansible which by default uses PXE + TFTP + iSCSI which is arguably not the best combination.</div><div><br></div><div>Cheers,</div><div>Mark</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 9 June 2017 at 12:28, Justin Kilpatrick <span dir="ltr"><<a href="mailto:jkilpatr@redhat.com" target="_blank">jkilpatr@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsur <<a href="mailto:dtantsur@redhat.com">dtantsur@redhat.com</a>> wrote:<br>
> This number of "300", does it come from your testing or from other sources?<br>
> If the former, which driver were you using? What exactly problems have you<br>
> seen approaching this number?<br>
<br>
</span>I haven't encountered this issue personally, but talking to Joe<br>
Talerico and some operators at summit around this number a single<br>
conductor begins to fall behind polling all of the out of band<br>
interfaces for the machines that it's responsible for. You start to<br>
see what you would expect from polling running behind, like incorrect<br>
power states listed for machines and a general inability to perform<br>
machine operations in a timely manner.<br>
<br>
Having spent some time at the Ironic operators form this is pretty<br>
normal and the correct response is just to scale out conductors, this<br>
is a problem with TripleO because we don't really have a scale out<br>
option with a single machine design. Fortunately just increasing the<br>
time between interface polling acts as a pretty good stopgap for this<br>
and lets Ironic catch up.<br>
<br>
I may get some time on a cloud of that scale in the future, at which<br>
point I will have hard numbers to give you. One of the reasons I made<br>
YODA was the frustrating prevalence of anecdotes instead of hard data<br>
when it came to one of the most important parts of the user<br>
experience. If it doesn't deploy people don't use it, full stop.<br>
<span class=""><br>
> Could you please elaborate? (a bug could also help). What exactly were you<br>
> doing?<br>
<br>
</span><a href="https://bugs.launchpad.net/ironic/+bug/1680725" rel="noreferrer" target="_blank">https://bugs.launchpad.net/<wbr>ironic/+bug/1680725</a><br>
<br>
Describes exactly what I'm experiencing. Essentially the problem is<br>
that nodes can and do fail to pxe, then cleaning fails and you just<br>
lose the nodes. Users have to spend time going back and babysitting<br>
these nodes and there's no good instructions on what to do with failed<br>
nodes anyways. The answer is move them to manageable and then to<br>
available at which point they go back into cleaning until it finally<br>
works.<br>
<br>
Like introspection was a year ago this is a cavalcade of documentation<br>
problems and software issues. I mean really everything *works*<br>
technically but the documentation acts like cleaning will work all the<br>
time and so does the software, leaving the user to figure out how to<br>
accommodate the realities of the situation without so much as a<br>
warning that it might happen.<br>
<br>
This comes out as more of a ux issue than a software one, but we can't<br>
just ignore these.<br>
<div class="HOEnZb"><div class="h5"><br>
- Justin<br>
<br>
______________________________<wbr>______________________________<wbr>______________<br>
OpenStack Development Mailing List (not for usage questions)<br>
Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.<wbr>openstack.org?subject:<wbr>unsubscribe</a><br>
<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-dev</a><br>
</div></div></blockquote></div><br></div>