<html><head></head><body><div class="ydpf5cbead1yahoo-style-wrap" style="font-family:times new roman, new york, times, serif;font-size:16px;"><div></div>

        <div dir="ltr" data-setdir="false">Hi Arne,</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">Thanks for responding.  </div><div dir="ltr" data-setdir="false">Yes, it is definitely an issue with the hash ring.</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">With Queens:</div><div dir="ltr" data-setdir="false">With 3 NCs and 3 ICs we are relatively stable.</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">With 6 NCs/6ICs,  it becomes pretty much unusable.  There seems to be a race condition where 2 NCs</div><div dir="ltr" data-setdir="false">are competing with each other to get hold of the provisioning.   A few transitions are</div><div dir="ltr" data-setdir="false">handled by one IC,  then when another NC takes over some transitions are handled by its IC.</div><div dir="ltr" data-setdir="false">So we end up in scenarios where the image download happens one one IC but due to the competing</div><div dir="ltr" data-setdir="false">NCs another IC is entrusted with doing the ISCSI transfer down to the node.  And the provision </div><div dir="ltr" data-setdir="false">fails because the image cannot be found.</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">Appreciate your quick response.</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">Regards,</div><div dir="ltr" data-setdir="false">Fred.</div><div><br></div>

        

        </div><div id="ydp1bda8e09yahoo_quoted_8047352530" class="ydp1bda8e09yahoo_quoted">

            <div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">

                

                <div>

                    On Wednesday, April 22, 2020, 12:41:44 AM PDT, Arne Wiebalck <arne.wiebalck@cern.ch> wrote:

                </div>

                <div><br></div>

                <div><br></div>

                <div><div dir="ltr">Hi Fred,<br clear="none"><br clear="none">For quite a while we ran with 3 ICs and 1 NC to manage ~5000 nodes.<br clear="none"><br clear="none">Since this brings some scaling issues with resource tracking, we have<br clear="none">started to split things into conductor groups. Currently, we are at 6<br clear="none">ICs and 3 NCs, but the plan is to have 10 ICs with 10 NCs managing<br clear="none">groups of ~500 nodes.<br clear="none"><br clear="none">The ICs and the NCs will basically be mapped 1:1, rather than having<br clear="none">all NCs see all ICs. The reason is that in the past we saw issues with<br clear="none">the hash ring when the nodes were visible to all NCs, e.g. multiple<br clear="none">NCs were claiming overlapping set of nodes ... having multiple ICs per<br clear="none">group is not an issue, though.<br clear="none"><br clear="none">We are currently still on Stein, but it could well be that you hit this<br clear="none">issue as when you add more NCs, the nodes will be reshuffled.<br clear="none"><br clear="none">Cheers,<br clear="none">  Arne<br clear="none"><div class="ydp1bda8e09yqt1447358853" id="ydp1bda8e09yqtfd73636"><br clear="none">On 22.04.20 00:32, <a shape="rect" href="mailto:fsbiz@yahoo.com" rel="nofollow" target="_blank">fsbiz@yahoo.com</a> wrote:<br clear="none">> Hi folks,<br clear="none">> <br clear="none">> We are seeing some weird issues with multiple compute nodes and would <br clear="none">> appreciate your thoughts.<br clear="none">> <br clear="none">> Background:<br clear="none">> We are on stable Queens.<br clear="none">> As part of an upgrade to accomodate 3X more servers, we decided to add <br clear="none">> three more compute nodes<br clear="none">> + three more ICs for a total of 6 compute nodes and 6 ICs.<br clear="none">> As soon as we added these in preparation for the 3X increase in servers <br clear="none">> I am seeing weird<br clear="none">> behaviour.<br clear="none">> <br clear="none">> A general question to everyone:<br clear="none">> How many of you run your baremetal clouds with 5+ computes and ICs?<br clear="none">> Are things stable with the setup ?<br clear="none">> <br clear="none">> Logs and Analysis:<br clear="none">> all compute and conductor services are up and running.<br clear="none">> <br clear="none">> 1) Baremetal node  c1bda753-d46c-4379-8d07-7787c2a4a7f2 mapped to <br clear="none">> sc-ironic08<br clear="none">> <a shape="rect" href="mailto:root@stg-cl1-dev-001" rel="nofollow" target="_blank">root@stg-cl1-dev-001</a>:~# openstack hypervisor show  <br clear="none">> c1bda753-d46c-4379-8d07-7787c2a4a7f2 | grep ironic<br clear="none">>           |<br clear="none">> | service_host         | sc-ironic08.nvc.nvidia.com<br clear="none">> <br clear="none">> 2)Mac address is 6c:b3:11:4f:8a:c0<br clear="none">> <a shape="rect" href="mailto:root@stg-cl1-dev-001" rel="nofollow" target="_blank">root@stg-cl1-dev-001</a>:~# openstack baremetal port list --node <br clear="none">> c1bda753-d46c-4379-8d07-7787c2a4a7f2<br clear="none">> +--------------------------------------+-------------------+<br clear="none">> | UUID                                 | Address           |<br clear="none">> +--------------------------------------+-------------------+<br clear="none">> | a517fb41-f977-438d-8c0d-21046e2918d9 | 6c:b3:11:4f:8a:c0 |<br clear="none">> +--------------------------------------+-------------------+<br clear="none">> <br clear="none">> <br clear="none">> <br clear="none">> <br clear="none">> 2)Provisioning starts:<br clear="none">> <br clear="none">> ironic06 receives the VIF update:  WHY ?<br clear="none">> 2020-04-21 15:05:47.509 71431 INFO ironic.conductor.manager VIF <br clear="none">> 657fea31-3218-4f10-b6ad-8b6a0fa7bab8 successfully attached to node <br clear="none">> c1bda753-d46c-4379-8d07-7787c2a4a7f2<br clear="none">> <br clear="none">> ironic08 (correct one) also receives updates.<br clear="none">> [<a shape="rect" href="mailto:root@sc-ironic08" rel="nofollow" target="_blank">root@sc-ironic08</a> master_images]# tail -f <br clear="none">> /var/log/ironic/ironic-conductor.log | grep <br clear="none">> c1bda753-d46c-4379-8d07-7787c2a4a7f2<br clear="none">> 2020-04-21 15:08:04.943 27542 INFO ironic.conductor.task_manager <br clear="none">> [req-259b0175-65bc-4707-8c88-a65189a29954 - - - - -] Node <br clear="none">> c1bda753-d46c-4379-8d07-7787c2a4a7f2 moved to provision state <br clear="none">> "deploying" from state "wait call-back"; target provision state is "active"<br clear="none">> <br clear="none">> <br clear="none">> For now we have backed down to 3 and are stable again but I would really <br clear="none">> like to overprovision our computes and conductors if possible.<br clear="none">> <br clear="none">> Please let me know your thoughts and if anything rings a bell.<br clear="none">> <br clear="none">> thanks,<br clear="none">> Fred.<br clear="none">> <br clear="none">> <br clear="none">> <br clear="none"><br clear="none"></div></div></div>

            </div>

        </div></body></html>