Tue Apr 28 04:00:13 UTC 2020

 Hi Dmitry,>I don't think the number of nova-compute instances is related to this problem. NC and IC are >balanced independently, every request from Nova is re-balanced again in ironic-api.
Yes, you are correct.  Before last week, the only documentation I knew about thattalks about the HA of ICs is this note buried in the installation guide: 
Asof the Newton release, it is possible to have multiple nova-compute servicesrunning the ironic virtual driver (in nova) to provide redundancy. Bare metalnodes are mapped to the services via a hash ring. If a service goes down, theavailable bare metal nodes are remapped to different services.

Onceactive, a node will stay mapped to the same nova-compute even when it goesdown. The node is unable to be managed through the Compute API until theservice responsible returns to an active state.

Reading that, I always assumed that the mapping of nodes <-> ICs was handledby one of the two services that run on each IC: openstack-nova-compute or openstack-ironic-conductor.

But the presentation of this bug didn't make sense under that assumption. Wewould see provisions that would have some steps on the correct IC, and some ona different IC. Completely odd that somehow some steps would happen on adifferent IC if that mapping is onan IC...

Doing some more digging, I found the following in the Ironic developerdocumentation (emphasis mine): 

EachConductor registers itself in the database upon start-up, and periodicallyupdates the timestamp of its record. Contained within this registration is alist of the drivers which this Conductor instance supports. This allows allservices to maintain a consistent view of which Conductors and which driversare available at all times.

Basedon their respective driver, all nodes are mapped across the set of availableConductors using a consistenthashing algorithm. Node-specifictasks are dispatched from the API tier to the appropriate conductor usingconductor-specific RPC channels. As Conductor instances join or leave the cluster, nodesmay be remapped to different Conductors, thus triggering various driver actionssuch as take-over or clean-up.
It turns out that the openstack-api-service thatruns on the CPNs is actually in charge of the nodes <-> IC mapping. If you hit anIronic API service with a IC request (like rebooting a node), the API service placesa message into (what it thinks is) the controlling IC's bucket in RabbitMQ.

>This is certainly concerning. When adding IC, do you wait for nodes to rebalance (it may take minutes)?I did but the nodes were not balancing.  What I thought was a race condition was actually the result of the hash ring being corrupted in one or more of the CPNs running ironic-api.
Once I realized howit works, and was able to find the bad API server -- I just restarted all 3 APIservices (one-by-one, by disabling the restarting member in the SLB to notreceive traffic).

Things seems stable again for now.

    On Monday, April 27, 2020, 08:52:13 AM PDT, Dmitry Tantsur <dtantsur at redhat.com> wrote:  

On Wed, Apr 22, 2020 at 5:59 PM fsbiz at yahoo.com <fsbiz at yahoo.com> wrote:

 Hi Arne,
Thanks for responding.  Yes, it is definitely an issue with the hash ring.
With Queens:With 3 NCs and 3 ICs we are relatively stable.
With 6 NCs/6ICs,  it becomes pretty much unusable.  There seems to be a race condition where 2 NCsare competing with each other to get hold of the provisioning. 

I don't think the number of nova-compute instances is related to this problem. NC and IC are balanced independently, every request from Nova is re-balanced again in ironic-api.
  A few transitions arehandled by one IC,  then when another NC takes over some transitions are handled by its IC.So we end up in scenarios where the image download happens one one IC but due to the competingNCs another IC is entrusted with doing the ISCSI transfer down to the node.  And the provision fails because the image cannot be found.

This is certainly concerning. When adding IC, do you wait for nodes to rebalance (it may take minutes)?

Appreciate your quick response.
    On Wednesday, April 22, 2020, 12:41:44 AM PDT, Arne Wiebalck <arne.wiebalck at cern.ch> wrote:  
 Hi Fred,

For quite a while we ran with 3 ICs and 1 NC to manage ~5000 nodes.

Since this brings some scaling issues with resource tracking, we have
started to split things into conductor groups. Currently, we are at 6
ICs and 3 NCs, but the plan is to have 10 ICs with 10 NCs managing
groups of ~500 nodes.

The ICs and the NCs will basically be mapped 1:1, rather than having
all NCs see all ICs. The reason is that in the past we saw issues with
the hash ring when the nodes were visible to all NCs, e.g. multiple
NCs were claiming overlapping set of nodes ... having multiple ICs per
group is not an issue, though.

We are currently still on Stein, but it could well be that you hit this
issue as when you add more NCs, the nodes will be reshuffled.


On 22.04.20 00:32, fsbiz at yahoo.com wrote:
> Hi folks,
> We are seeing some weird issues with multiple compute nodes and would 
> appreciate your thoughts.
> Background:
> We are on stable Queens.
> As part of an upgrade to accomodate 3X more servers, we decided to add 
> three more compute nodes
> + three more ICs for a total of 6 compute nodes and 6 ICs.
> As soon as we added these in preparation for the 3X increase in servers 
> I am seeing weird
> behaviour.
> A general question to everyone:
> How many of you run your baremetal clouds with 5+ computes and ICs?
> Are things stable with the setup ?
> Logs and Analysis:
> all compute and conductor services are up and running.
> 1) Baremetal node  c1bda753-d46c-4379-8d07-7787c2a4a7f2 mapped to 
> sc-ironic08
> root at stg-cl1-dev-001:~# openstack hypervisor show  
> c1bda753-d46c-4379-8d07-7787c2a4a7f2 | grep ironic
>           |
> | service_host         | sc-ironic08.nvc.nvidia.com
> 2)Mac address is 6c:b3:11:4f:8a:c0
> root at stg-cl1-dev-001:~# openstack baremetal port list --node 
> c1bda753-d46c-4379-8d07-7787c2a4a7f2
> +--------------------------------------+-------------------+
> | UUID                                 | Address           |
> +--------------------------------------+-------------------+
> | a517fb41-f977-438d-8c0d-21046e2918d9 | 6c:b3:11:4f:8a:c0 |
> +--------------------------------------+-------------------+
> 2)Provisioning starts:
> ironic06 receives the VIF update:  WHY ?
> 2020-04-21 15:05:47.509 71431 INFO ironic.conductor.manager VIF 
> 657fea31-3218-4f10-b6ad-8b6a0fa7bab8 successfully attached to node 
> c1bda753-d46c-4379-8d07-7787c2a4a7f2
> ironic08 (correct one) also receives updates.
> [root at sc-ironic08 master_images]# tail -f 
> /var/log/ironic/ironic-conductor.log | grep 
> c1bda753-d46c-4379-8d07-7787c2a4a7f2
> 2020-04-21 15:08:04.943 27542 INFO ironic.conductor.task_manager 
> [req-259b0175-65bc-4707-8c88-a65189a29954 - - - - -] Node 
> c1bda753-d46c-4379-8d07-7787c2a4a7f2 moved to provision state 
> "deploying" from state "wait call-back"; target provision state is "active"
> For now we have backed down to 3 and are stable again but I would really 
> like to overprovision our computes and conductors if possible.
> Please let me know your thoughts and if anything rings a bell.
> thanks,
> Fred.

