On Thu, Dec 21, 2023 at 11:28 AM Thiago De Moraes Teixeira <teixeira.thiago@luizalabs.com> wrote:

do that inside a maintenance window,
to not be concerned about availability issues during the process.

More than that I would encourage you to disable writes; ideally just turn off all your proxies. You should probably also stop the account/container consistency daeamons - or teach them how to abort/skip rings marked for part power like the object services do:

https://github.com/NVIDIA/swift/blob/master/swift/obj/replicator.py#L911-L917

your recommendation is do this increase 1 per 1 part
power, like 8->9;9->10;...19->20?

I mean going from 8->9 already *doubles* your number of partitions. exponential goes pretty fast. So even if you only have 1 part per device, 2(9), 4(10), 8(11), 16(12), 32(13), 64(14), 128(15) - you can probably stop once you have >100 parts per device. But I'm only "recommending" it because that's what's supported and what we've seen work in the past.

One idea I've just thought of, besides this adaptation on swift ring
builder, is to add an input argument to map new part power value
(thinking if this makes sense...)

I think that makes sense for your use-case. Last time we did a part power increase we did *3* in a row. Each increase would disable replication while relinker/cleanup was running - then we'd let replication run for ~24 hours to address any inconsistencies, swap out failed drives etc. Then repeat the whole process again - it was pretty intensive for ops, but they got good at it and our users didn't notice.

Since account/container doesn't support writes to relinking databases - you're going to be taking the cluster offline. Depending on how your ring management is automated it might be better to do the full relink from part power 8->X in one ring push. Christian is the expert but I *think* however many times you want to split a part you *should* be able to assign those new parts to the device they're actually on.

If you're doing say 3 increases at once instead of part 1 splitting to 2, 3 - it'd split to [8, 9, 10, 11, 12, 13, 14, 15] - and you'd want to make sure all those parts are assigned to each of the devices currently holding a replica of 1

(you should double check my math: https://paste.openstack.org/show/b0TKpjHzXSs7z8xrSVtK/)

The second step after you get your ring figured out is to traverse the account/container device's filesystem trees and for each database calculate what the new partition assignment (each individual database in part 1 will only go into one of the new parts). Then you just move it into the right dir on the same device - since if you did your ring manipulation correctly it should already belong on that device. You can probably use the existing object-relinker as inspiration but it would require heavy modification to work with account and container databases:

https://github.com/NVIDIA/swift/blob/master/swift/cli/relinker.py

I'd encourage you to share any modifications on gerrit even if you're not interested in polishing them for contributions upstream; others could potentially use your work as a starting point. The benefit to you would be mostly in the attention of people that have gone through this process for object part power increase might be able to spot issues with your attempted approach (i.e. bugs) or contribute some tests that may help ensure your migration goes smoothly/quickly.

About our cluster, we are gathering information about the production
environment like how many accounts/containers do we have exactly

totals are good, also number of databases per disk and per partition (max, min, avg, std-dev)

I can come back
later with useful information when I've it!

Appreciate!

That'll be great fun! I appreciate you taking on such an exciting endeavor and sharing your experience with the community.