Sharding with and/or within cells will help to some degree (and we are actively looking into this as you probably know), but I think that should not stop us from checking if there are algorithmic improvements (e.g. when collecting the data), or if moving to a different locking granularity or even parallelising the update are feasible additional improvements.
All of that code was designed around one node per compute host. In the ironic case it was expanded (hacked) to support N where N is not huge. Giving it a huge number, and using a driver where nodes go into maintenance/cleaning for long periods of time is asking for trouble.
Given there is only one case where N can legitimately be greater than one, I'm really hesitant to back a proposal to redesign it for large values of N.
Perhaps we as a team just need to document what sane, tested, and expected-to-work values for N are?
--Dan