<div dir="ltr">This is great information Justin, thanks for sharing. It will prove useful as we scale up our ironic deployments.<div><br></div><div>It seems to me that a reference configuration of ironic would be a useful resource for many people. Some key decisions affecting scalability and performance may at first seem arbitrary but have an impact on performance and scalability, such as:</div><div><br></div><div>- BIOS vs. UEFI</div><div>- PXE vs. iPXE bootloader</div><div>- TFTP vs. HTTP for kernel/ramdisk transfer</div><div>- iSCSI vs. Swift (or one day standalone HTTP?) for image transfer</div><div>- Hardware specific drivers vs. IPMI</div><div>- Local boot vs. netboot</div><div>- Fat images vs. slim + post-configuration</div><div>- Any particularly useful configuration tunables (power state polling interval, nova build concurrency, others?)</div><div><br></div><div>I personally use kolla + kolla-ansible which by default uses PXE + TFTP + iSCSI which is arguably not the best combination.</div><div><br></div><div>Cheers,</div><div>Mark</div></div><div class="gmail_extra"><br><div class="gmail_quote">On 9 June 2017 at 12:28, Justin Kilpatrick <span dir="ltr"><<a href="mailto:jkilpatr@redhat.com" target="_blank">jkilpatr@redhat.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsur <<a href="mailto:dtantsur@redhat.com">dtantsur@redhat.com</a>> wrote:<br>

> This number of "300", does it come from your testing or from other sources?<br>

> If the former, which driver were you using? What exactly problems have you<br>

> seen approaching this number?<br>

<br>

</span>I haven't encountered this issue personally, but talking to Joe<br>

Talerico and some operators at summit around this number a single<br>

conductor begins to fall behind polling all of the out of band<br>

interfaces for the machines that it's responsible for. You start to<br>

see what you would expect from polling running behind, like incorrect<br>

power states listed for machines and a general inability to perform<br>

machine operations in a timely manner.<br>

<br>

Having spent some time at the Ironic operators form this is pretty<br>

normal and the correct response is just to scale out conductors, this<br>

is a problem with TripleO because we don't really have a scale out<br>

option with a single machine design. Fortunately just increasing the<br>

time between interface polling acts as a pretty good stopgap for this<br>

and lets Ironic catch up.<br>

<br>

I may get some time on a cloud of that scale in the future, at which<br>

point I will have hard numbers to give you. One of the reasons I made<br>

YODA was the frustrating prevalence of anecdotes instead of hard data<br>

when it came to one of the most important parts of the user<br>

experience. If it doesn't deploy people don't use it, full stop.<br>

<span class=""><br>

> Could you please elaborate? (a bug could also help). What exactly were you<br>

> doing?<br>

<br>

</span><a href="https://bugs.launchpad.net/ironic/+bug/1680725" rel="noreferrer" target="_blank">https://bugs.launchpad.net/<wbr>ironic/+bug/1680725</a><br>

<br>

Describes exactly what I'm experiencing. Essentially the problem is<br>

that nodes can and do fail to pxe, then cleaning fails and you just<br>

lose the nodes. Users have to spend time going back and babysitting<br>

these nodes and there's no good instructions on what to do with failed<br>

nodes anyways. The answer is move them to manageable and then to<br>

available at which point they go back into cleaning until it finally<br>

works.<br>

<br>

Like introspection was a year ago this is a cavalcade of documentation<br>

problems and software issues. I mean really everything *works*<br>

technically but the documentation acts like cleaning will work all the<br>

time and so does the software, leaving the user to figure out how to<br>

accommodate the realities of the situation without so much as a<br>

warning that it might happen.<br>

<br>

This comes out as more of a ux issue than a software one, but we can't<br>

just ignore these.<br>

<div class="HOEnZb"><div class="h5"><br>

- Justin<br>

<br>

______________________________<wbr>______________________________<wbr>______________<br>

OpenStack Development Mailing List (not for usage questions)<br>

Unsubscribe: <a href="http://OpenStack-dev-request@lists.openstack.org?subject:unsubscribe" rel="noreferrer" target="_blank">OpenStack-dev-request@lists.<wbr>openstack.org?subject:<wbr>unsubscribe</a><br>

<a href="http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev" rel="noreferrer" target="_blank">http://lists.openstack.org/<wbr>cgi-bin/mailman/listinfo/<wbr>openstack-dev</a><br>

</div></div></blockquote></div><br></div>