[ironic]: multiple compute nodes

Dmitry Tantsur dtantsur at redhat.com
Mon Apr 27 15:46:27 UTC 2020


Hi,

On Wed, Apr 22, 2020 at 5:59 PM fsbiz at yahoo.com <fsbiz at yahoo.com> wrote:

> Hi Arne,
>
> Thanks for responding.
> Yes, it is definitely an issue with the hash ring.
>
> With Queens:
> With 3 NCs and 3 ICs we are relatively stable.
>
> With 6 NCs/6ICs,  it becomes pretty much unusable.  There seems to be a
> race condition where 2 NCs
> are competing with each other to get hold of the provisioning.
>

I don't think the number of nova-compute instances is related to this
problem. NC and IC are balanced independently, every request from Nova is
re-balanced again in ironic-api.


>  A few transitions are
> handled by one IC,  then when another NC takes over some transitions are
> handled by its IC.
> So we end up in scenarios where the image download happens one one IC but
> due to the competing
> NCs another IC is entrusted with doing the ISCSI transfer down to the
> node.  And the provision
> fails because the image cannot be found.
>

This is certainly concerning. When adding IC, do you wait for nodes to
rebalance (it may take minutes)?

Dmitry


>
> Appreciate your quick response.
>
> Regards,
> Fred.
>
> On Wednesday, April 22, 2020, 12:41:44 AM PDT, Arne Wiebalck <
> arne.wiebalck at cern.ch> wrote:
>
>
> Hi Fred,
>
> For quite a while we ran with 3 ICs and 1 NC to manage ~5000 nodes.
>
> Since this brings some scaling issues with resource tracking, we have
> started to split things into conductor groups. Currently, we are at 6
> ICs and 3 NCs, but the plan is to have 10 ICs with 10 NCs managing
> groups of ~500 nodes.
>
> The ICs and the NCs will basically be mapped 1:1, rather than having
> all NCs see all ICs. The reason is that in the past we saw issues with
> the hash ring when the nodes were visible to all NCs, e.g. multiple
> NCs were claiming overlapping set of nodes ... having multiple ICs per
> group is not an issue, though.
>
> We are currently still on Stein, but it could well be that you hit this
> issue as when you add more NCs, the nodes will be reshuffled.
>
> Cheers,
>   Arne
>
> On 22.04.20 00:32, fsbiz at yahoo.com wrote:
> > Hi folks,
> >
> > We are seeing some weird issues with multiple compute nodes and would
> > appreciate your thoughts.
> >
> > Background:
> > We are on stable Queens.
> > As part of an upgrade to accomodate 3X more servers, we decided to add
> > three more compute nodes
> > + three more ICs for a total of 6 compute nodes and 6 ICs.
> > As soon as we added these in preparation for the 3X increase in servers
> > I am seeing weird
> > behaviour.
> >
> > A general question to everyone:
> > How many of you run your baremetal clouds with 5+ computes and ICs?
> > Are things stable with the setup ?
> >
> > Logs and Analysis:
> > all compute and conductor services are up and running.
> >
> > 1) Baremetal node  c1bda753-d46c-4379-8d07-7787c2a4a7f2 mapped to
> > sc-ironic08
> > root at stg-cl1-dev-001:~# openstack hypervisor show
> > c1bda753-d46c-4379-8d07-7787c2a4a7f2 | grep ironic
> >           |
> > | service_host         | sc-ironic08.nvc.nvidia.com
> >
> > 2)Mac address is 6c:b3:11:4f:8a:c0
> > root at stg-cl1-dev-001:~# openstack baremetal port list --node
> > c1bda753-d46c-4379-8d07-7787c2a4a7f2
> > +--------------------------------------+-------------------+
> > | UUID                                 | Address           |
> > +--------------------------------------+-------------------+
> > | a517fb41-f977-438d-8c0d-21046e2918d9 | 6c:b3:11:4f:8a:c0 |
> > +--------------------------------------+-------------------+
> >
> >
> >
> >
> > 2)Provisioning starts:
> >
> > ironic06 receives the VIF update:  WHY ?
> > 2020-04-21 15:05:47.509 71431 INFO ironic.conductor.manager VIF
> > 657fea31-3218-4f10-b6ad-8b6a0fa7bab8 successfully attached to node
> > c1bda753-d46c-4379-8d07-7787c2a4a7f2
> >
> > ironic08 (correct one) also receives updates.
> > [root at sc-ironic08 master_images]# tail -f
> > /var/log/ironic/ironic-conductor.log | grep
> > c1bda753-d46c-4379-8d07-7787c2a4a7f2
> > 2020-04-21 15:08:04.943 27542 INFO ironic.conductor.task_manager
> > [req-259b0175-65bc-4707-8c88-a65189a29954 - - - - -] Node
> > c1bda753-d46c-4379-8d07-7787c2a4a7f2 moved to provision state
> > "deploying" from state "wait call-back"; target provision state is
> "active"
> >
> >
> > For now we have backed down to 3 and are stable again but I would really
> > like to overprovision our computes and conductors if possible.
> >
> > Please let me know your thoughts and if anything rings a bell.
> >
> > thanks,
> > Fred.
> >
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openstack.org/pipermail/openstack-discuss/attachments/20200427/6375170e/attachment-0001.html>


More information about the openstack-discuss mailing list