Re: [ironic]: multiple compute nodes

22 Apr 2020

      Hi Arne,
Thanks for responding.  Yes, it is definitely an issue with the hash ring.
With Queens:With 3 NCs and 3 ICs we are relatively stable.
With 6 NCs/6ICs,  it becomes pretty much unusable.  There seems to be a race condition where 2 NCsare competing with each other to get hold of the provisioning.   A few transitions arehandled by one IC,  then when another NC takes over some transitions are handled by its IC.So we end up in scenarios where the image download happens one one IC but due to the competingNCs another IC is entrusted with doing the ISCSI transfer down to the node.  And the provision fails because the image cannot be found.
Appreciate your quick response.
Regards,Fred.
    On Wednesday, April 22, 2020, 12:41:44 AM PDT, Arne Wiebalck <arne.wiebalck@cern.ch> wrote:  

 Hi Fred,

For quite a while we ran with 3 ICs and 1 NC to manage ~5000 nodes.

Since this brings some scaling issues with resource tracking, we have
started to split things into conductor groups. Currently, we are at 6
ICs and 3 NCs, but the plan is to have 10 ICs with 10 NCs managing
groups of ~500 nodes.

The ICs and the NCs will basically be mapped 1:1, rather than having
all NCs see all ICs. The reason is that in the past we saw issues with
the hash ring when the nodes were visible to all NCs, e.g. multiple
NCs were claiming overlapping set of nodes ... having multiple ICs per
group is not an issue, though.

We are currently still on Stein, but it could well be that you hit this
issue as when you add more NCs, the nodes will be reshuffled.

Cheers,
  Arne

On 22.04.20 00:32, fsbiz@yahoo.com wrote:
...
Hi folks,
We are seeing some weird issues with multiple compute nodes and would 
appreciate your thoughts.
Background:
We are on stable Queens.
As part of an upgrade to accomodate 3X more servers, we decided to add 
three more compute nodes
+ three more ICs for a total of 6 compute nodes and 6 ICs.
As soon as we added these in preparation for the 3X increase in servers 
I am seeing weird
behaviour.
A general question to everyone:
How many of you run your baremetal clouds with 5+ computes and ICs?
Are things stable with the setup ?
Logs and Analysis:
all compute and conductor services are up and running.
1) Baremetal node  c1bda753-d46c-4379-8d07-7787c2a4a7f2 mapped to 
sc-ironic08
root@stg-cl1-dev-001:~# openstack hypervisor show  
c1bda753-d46c-4379-8d07-7787c2a4a7f2 | grep ironic
           |
| service_host         | sc-ironic08.nvc.nvidia.com
2)Mac address is 6c:b3:11:4f:8a:c0
root@stg-cl1-dev-001:~# openstack baremetal port list --node 
c1bda753-d46c-4379-8d07-7787c2a4a7f2
+--------------------------------------+-------------------+
| UUID                                 | Address           |
+--------------------------------------+-------------------+
| a517fb41-f977-438d-8c0d-21046e2918d9 | 6c:b3:11:4f:8a:c0 |
+--------------------------------------+-------------------+
2)Provisioning starts:
ironic06 receives the VIF update:  WHY ?
2020-04-21 15:05:47.509 71431 INFO ironic.conductor.manager VIF 
657fea31-3218-4f10-b6ad-8b6a0fa7bab8 successfully attached to node 
c1bda753-d46c-4379-8d07-7787c2a4a7f2
ironic08 (correct one) also receives updates.
[root@sc-ironic08 master_images]# tail -f 
/var/log/ironic/ironic-conductor.log | grep 
c1bda753-d46c-4379-8d07-7787c2a4a7f2
2020-04-21 15:08:04.943 27542 INFO ironic.conductor.task_manager 
[req-259b0175-65bc-4707-8c88-a65189a29954 - - - - -] Node 
c1bda753-d46c-4379-8d07-7787c2a4a7f2 moved to provision state 
"deploying" from state "wait call-back"; target provision state is "active"
For now we have backed down to 3 and are stable again but I would really 
like to overprovision our computes and conductors if possible.
Please let me know your thoughts and if anything rings a bell.
thanks,
Fred.

Re: [ironic]: multiple compute nodes

fsbiz＠yahoo.com