On Mon, Nov 11, 2019 at 4:05 PM Dan Smith <dms@danplanet.com> wrote:

> Sharding with and/or within cells will help to some degree (and we are
> actively looking into this as you probably know), but I think that
> should not stop us from checking if there are algorithmic improvements
> (e.g. when collecting the data), or if moving to a different locking
> granularity or even parallelising the update are feasible additional
> improvements.

All of that code was designed around one node per compute host. In the
ironic case it was expanded (hacked) to support N where N is not
huge. Giving it a huge number, and using a driver where nodes go into
maintenance/cleaning for long periods of time is asking for trouble.

Given there is only one case where N can legitimately be greater than
one, I'm really hesitant to back a proposal to redesign it for large
values of N.

Perhaps we as a team just need to document what sane, tested, and
expected-to-work values for N are?

What we discussed at the PTG was the fact that we only have one global semaphore for this module but we have N ResourceTracker python objects (where N is the number of Ironic nodes per compute service).

As per CERN, it looks this semaphore blocks when updating periodically so we basically said it could only be a bugfix given we could create N semaphores instead.

That said, as it could have some problems, we want to make sure we can test the change not only by the gate but also directly by CERN.

Another discussion was about having more than one thread for the compute service (ie. N threads) but my opinion was that we should first look at the above before discussing about any other way.

-S

--Dan